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Abstract 

We consider the problem of high-dimensional non-linear variable selection for supervised 
learning. Our approach is based on performing linear selection among exponentially many ap- 
propriately defined positive definite kernels that characterize non-linear interactions between 
the original variables. To select efficiently from these many kernels, we use the natural hierar- 
chical structure of the problem to extend the multiple kernel learning framework to kernels that 
can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel 
selection through a graph-adapted sparsity-inducing norm, in polynomial time in the number of 
selected kernels. Moreover, we study the consistency of variable selection in high-dimensional 
settings, showing that under certain assumptions, our regularization framework allows a num- 
ber of irrelevant variables which is exponential in the number of observations. Our simulations 
on synthetic datasets and datasets from the UCI repository show state-of-the-art predictive per- 
formance for non-linear regression problems. 



1 Introduction 



High-dimensional problems represent a recent and important topic in machine learning, statistics 
and signal processing. In such settings, some notion of sparsity is a fruitful way of avoiding over- 
fitting, for example through variable or feature selection. This has led to many algorithmic and 
theoretical advances. In particular-, regularization by sparsity-inducing norms such as the £i-norm 
has attracted a lot of interest in recent years. While early work has focused on efficient algo- 
rithms to solve the convex optimization problems, recent research has looked at the model selec- 
tion propertie s and predictive pe r formance of such methods, in the linear case dZhao and YuLbood 
Yuan and Linl . l2007l : Eoul. l2006l : IWainwrightl . l2009l : iBickel et all. \2004 : Izhanel. l2009ah or within 
constrained non-linear settings such as the mult i ple kernel learning framework (ILanckriet et al 



2004bl : ISrebro and Ben-Davidf . l2006l :lBach'.'2008a':'Koltchinskii and YuanLboOSlilYing and Campbell 
2009 ) or generalized additive models (iRavikumar et al... ,2008. : iLin and ZhangL 20061) . 
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However, most of the recent work dealt with linear high-dimensional variable selection, while 
the focus of much of the earlier work in machine learning and statistics was on non-linear low- 
dimensional problems: indeed, in the last two decades, kernel methods have been a prolific theoret- 
ical and algorithmic machine learning framework. By using appropriate regulaiization by Hilber- 
tian norms, representer theorems enable to consider large and potentially infinite-dimensional fea- 
ture spaces while working within an implicit feature space no larger than the number of obser- 
vations. This has led to numerous works on kernel design adapte d to specific data types and 
generic kernel-based algorithms for many learning tasks (see, e.g., ISchoLkopf and Smolal . l2002l : 



Shawe-Taylor and Cristianinil . 120041 '). However, while non-linearity is required in many domains 



such as computer vision or bioinformatics, most theoretical results related to non-parametric meth- 
ods do not scale well with input dimensions. In this paper, our goal is to bridge the gap between 
linear and non-linear methods, by tackling high-dimensional non-linear problems. 

The task of non-linear variable section is a hard problem with few approaches that have both 
good theoretical and algorithmic properties, in particular- in high-dimensional settings. Among 
classical m ethods, some are imp l icitly or explicitly based on sparsity and mod el selection, suc h 



as boosting (IFreund and Schapird.ll997h . multivariat e additive regres sion spli nes (lFriedmanl.ll99lh . 



decision trees ( Breiman et al.l 1984 ), random f orests ( Breiman 2001 ), Cosso ( Lin and Zhang , 2006 ) 



or Gaussian process based methods (see, e.g., Rasmussen and Williamsl. 2006), while some others 



do not rely on sparsity, such as neare st neighbors or kernel methods (see, e.g., 
Shawe-Tavlor and Cristianinil . l2004l \ 



Devrove et al. 



199^ : 



First attempts were made to combine non-linearity and sparsity-inducing norms by considering 
generalized additive models, where the predict or function i s assumed to be a sparse linear combi 



nation of non-linear functions of each variable (IBach et al.l l2004al : iBachl 12008 al : iRavikumar et al 



20081) . However, as shown in Section 15.31 higher orders of interactions are needed for universal 
consistency, i.e., to adapt to the potential high complexity of the interactions between the relevant 
variables; we need to potentially allow 2^ of them for p variables (for all possible subsets of the p 
variables). Theoretical results suggest that with appropriate assumptions, sparse methods such as 
greedy methods and methods based on the ^i-norm would be able to deal correctly with 2^ features 
if jp is of the o rder of the number of observations n dWainwrightl . l2009l : ICandes and Wakinl. l2008l : 



Zhangl.l2009bl) . However, in presence of more than a few dozen variables, in order to deal with that 



many features, or even to simply enumerate those, a certain form of factorization or recursivity is 
needed. In this paper, we propose to use a hierarchical structure based on directed acyclic graphs, 
which is natural in our context of non-lineai" variable selection. 

We consider a positive definite kernel that can be expressed as a large sum of positive defi- 
nite basis or local kernels. This exactly con^esponds to the situation where a large feature space is 
the concatenation of smaller feature spaces, and we aim to do selection among these many kernels 



(or equ ivalently feature spaces), which may be done through multiple kernel learning (|Bach et al 



2004al) . One major difficulty however is that the number of these smaller kernels is usually expo- 



nential in the dimension of the input space and applying multiple kernel learning directly to this 
decomposition would be intractable. As shown in Section ll!2l for non-linear variable selection, we 
consider a sum of kernels which are indexed by the set of subsets of all considered variables, or 
more generally by {0, ... , qY\ for g ^ 1. 

In order to perform selection efficiently, we make the extra ass umption that these small kernels 
can be embedded in a directed acyclic graph (DAG). Following IZhao et al. (l2009t) . we consider 
in Section 12 a specific combination of ^2-norms that is adapted to the DAG, and that will restrict 
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the authorized sparsity patterns to certain configurations; in our specific kernel-based framework, 
we are able to use the DAG to design an optimization algorithm which has polynomial complexity 
in the number of selected kernels (Section |4ll. In simulations (Section [6l), we focus on directed 
grids, where our framework allows to perform non-lineai" variable selection. We provide some 
experimental validation of our novel regularization framework; in particular, we compare it to the 
regular £2 -regularization, greedy forward selection and non-kernel-based methods, and shows that 
it is always competitive and often leads to better performance, both on synthetic examples, and 
standard regression datasets from the UCI repository. 

Finally, we extend in Sect i on [5] some of the known consistency results of the Lasso and multiple 
kernel learning (|Zhao and YuL 120061 : lBachLl2008a) . and give a partial answer to the model selection 
capabilities of our regularization framework by giving necessary and sufficient conditions for model 
consistency. In particular, we show that our framework is adapted to estimating consistently only 
the hull of the relevant variables. Hence, by restricting the statistical power of our method, we gain 
computational efficiency. Moreover, we show that we can obtain scalings bet ween the number of 



variables and the number of observations wh i ch are similar to the linea r case ( Wainwright , 



Wainwright . 



2009; 



20091 : 



Candes an d Wakin. '2008*: 'Zhao and Yul. I2OO6I : l^ian and LinL lioOTi : Eoui E0O6I : ^ 

Bickel et al.,. ,2009. : ,Zhana . ilOO^JT indeed, we show that our regularization framework may achieve 
non-linear variable selection consistency even with a number of variables p which is exponential in 
the number of observations n. Since we deal with 2^ kernels, we achieve consistency with a number 
of kernels which is doubly exponential in n. Moreover, for general directed acyclic graphs, we show 
that the total number of vertices may grow unbounded as long as the maximal out-degree (number 
of children) in the DAG is less than exponential in the number of observations. 

This paper extends previous work ( BachL 2008b), by providing more background on multiple 
kernel learning, detailing all proofs, providing new consistency results in high dimension, and com- 
paring our non-linear predictors with non-kemel-based methods. 



Notation. Throughout the paper we consider Hilbertian norms ||/|| for elements / of Hilbert 
spaces, where the specific Hilbert space can always be inferred from the context (unless otherwise 
stated). For rectangular matrices A, we denote by ||j4||op its largest singular value. We denote by 
Amax(Q) and Amin(Q) the largest and smallest eig envalue of a symmetric matri x Q. These are 
naturally extended to compact self-adjoint operators ( Brezis . 1980l : Conway . 1997 ). 



Moreover, given a vector v in the product space J^i x ■ ■ ■ x Tp and a subset / of {1, . . . 
vj denotes the vector in of elements of v indexed by /. Similarly, for a matrix A defined 

with p X p blocks adapted to J^i, . . . ,J^p, Ajj denotes the submatrix of A composed of blocks 
of A whose rows are in / and columns are in J. Moreover, \J\ denotes the cardinal of the set J 
and 1^1 denotes the dimension of the Hilbert space JF. We denote by 1„ the n-dimensional vector 
of ones. We denote by (a)+ = max{0,a} the positive part of a real number a. Besides, given 
matrices Ai, . . . , An, and a subset / of {1, ... , n}, Diag(A)/ denotes the block-diagonal matrix 
composed of the blocks indexed by /. Finally, we let denote F and E general probability measures 
and expectations. 
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Loss Lpi{Ui) 


Fenchel conjugate ipil^Pi) 


Least-squares regression 


^ (o, 1/ \^ 

2\yi — Ui) 


2Pi + Piyi 


1-norm support 


ilVi -Ui\ 




vecior regression (^o vkj 




-pOO oinerwise 


2 -norm support 


2{\yi - Ui\ - ^) + 


2Pi + Piyi + \Pi\^ 


vecioi regression viv^ 






Hiiber regression 


~n V2 -f i 


2Pi + Piyi 11 \Pi\ ^ £ 




2 

e\yi — Ui\ — ^ otherwise 


+00 Otherwise 


Logistic regression 


log(l + exp{-yiUi)) 


{l+PiVi) \og{l+ (iiVi) -Piyi logi-PiVi) 
if Piyi £ [— 1)0]> +00 otherwise 


1-norm support 


max(0, 1 - ViUi) 


yiPiifpiVr G [-1,0] 


vector machine (SVM) 




+00 otherwise 


2-norm support 


\ max(0, 1 - ViUiY 


^Pf + P^yi if Piy^^O 


vector machine (SVM) 




+00 otherwise 



Table 1: Loss functions with corresponding Fenchel conjugates, for regression (first three losses, 
yi G M) and binary classification (last three losses, yi E { — 1, 1}- 



2 Review of Multiple Kernel Learning 

We consider the problem a predicting a response y G R from a variable X ^ X, where X may 
be any set of inputs, refen^ed to as the input space. In this section, we review the multiple kernel 
learning framework our paper relies on. 



2.1 Loss Functions 

We assume that we are given n observations of the couple {X,Y), i.e., {xi,yi) G x 3^ for 
i = 1, . . . , n. We define the empirical risk of a function / from A' to M as 



1 " 

-^£{y,J(x,)), 



i=l 



where ^ : M x M is a loss function. We only assume that £ is convex with respect to the 

second param eter (but not necess arily differentiable). 

Following iBach et al.l (l2004bl) and lSonnenburg et al.l (120061) . in order to derive optimality condi- 
tions for all losses, we need to introduce Fe nchel conjugates (see examples i n Table[T]and Figure[Tl). 
Let ^/^j : M M, be the Fenchel conjugate ( Boyd and Vandenberghe . 20031) of the convex function 
ifi : Ui^ i{yi, Ui), defined as 

ipiiPi) = max UiPi - Lpi{ui) = max mPi - i{yi,Ui). 

The function tpi is always convex and, because we have assumed that ipi is convex, we can represent 
ifi as the Fenchel conjugate of ^j, i.e., for all Ui E M, 



£{yi,Ui) = (pi{ui) = max mPi - ipiiPi). 
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Moreover, in order to include an unregularized constant term, we will need to be able to solve with 
respect to 6 G M the following optimization problem: 



mm 



1 " 

-y^^ipi{ui + 5). 



(1) 



i=l 



For n G M", we let denote by b* (u) any solution of Eq. ([U. It can either be obtained in closed form 
(least-squares regression), using Newton-Raphson (logistic regression), or by ordering the values 
Ui £ W, i = 1, . . . ,n (all other piecewise quadratic losses). In Section HI we study in details losses 
for which the Fenchel conjugate ipi is strictly convex, such as for logistic regression, 2-norm SVM, 
2-norm SVR and least-squaies regression. 



2.2 Single Kernel Learning Problem 

In this section, we assume that we are given a positive definite kernel k{x, x') on X. We can then 
define a reproducing kernel Hilbert space (RKHS) as the comple tion of the linear span of functions 
X H-> k{x,x') for x' £ X (IBerlinet and Thomas-Agnanl . |2003|) . We can define the, feature map 
<^ : X ^ such that for all x £ X, f{x) = (/, and for all x, x' € X, = k{x, x'); 

we denote by ||/|| the norm of the function f ^ T . We consider the single kernel learning problem: 



1 " A 



(2) 



The followin g proposition gives its dual, providing a convex instance of the representer theo- 
rem (see, e.g. IShawe-Taylor and Cristianinil . |2004j; ISchoUcopf and Smolal. |2002| . and proof in Ap- 
pendix |Ai2]): 

Proposition 1 (Dual problem for single kernel learning problem) The dual of the optimization 
problem in Eq. ([2]) is 

1 " A 

max %l}i{—n\ai) oi^ Ka, (3) 

qgM", iTq=o n ^ 2 
i=i 
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where K € M"^" is the kernel matrix defined as Kij = k{xi, xj). The unique primal solution f 
can be found from an optimal a as f = Y17=i ^i^{^i)> '^^^ ^ — b*{Ka). 

Note that if the Fenchel conjugate is strictly convex or if the kernel matrix is invertible, then the dual 
solution a is also unique. In Eq. the kernel matrix K may be replaced by its centered version 

K=il-\lnll)K{l-^lr.ll), 



defined as the kernel matrix of the c entered observed features (see, e.g. lShawe-Taylor and Cristianini 
20041 : ISchoUcopf and Smolal . l2002f) . Indeed, we have Ka = Ka in Eq. ([3]); however, in the 



definition of 6 = h*{Ka), K cannot be replaced by K. 

Finally, the duality gap obtained from a vector a G M" such that l^a = 0, and the associated 
primal candidates from Proposition [T] is equal to 

n 1 " 

gaPkcrnci {K, a) = - S^ ipi [{Ka)i + b*{Ka)] + Xa^Ka + - il^ii-nXai). (4) 

i=l 2=1 

2.3 Sparse Learning with Multiple Kernels 

We now assume that we are given p different reproducing kernel Hilbert spaces Tj on X, associated 
with positive definite kernels kj : X ^ X ^ j = 1, . . . , p, and associated fe ature maps : 
X ^ Tj. We consider generalized additive models ( Hastie and Tibshirani . 1990h . i.e., predictors 



parameterized by / = (/i, . . . , /p) G ^ = x • • • x of the form 

V V 

fix) + b = ^fjix) + b = ^{f,,<fjix)) + 6, 

=1 



where each fj G JFj and 6 G M is a constant term. We let denote ||/|| the Hilbertian norm of / G 
.Fi X • • • X J-p, defined as = ll/if ■ 

We consider regularizing by the sum of the Hilbertian norms, J2^=i 1 1/? 1 1 (which is not itself 
a Hilbertian norm), with the intuition that this norm will push some of the functions fj towards 
zero, and thus provide data-dependent selection of the feature spaces Tj, j = 1, . . . ,p, and hence 
selection of the kernels kj, j = 1, . . . ,p. We thus consider the following optimization problem: 



">■" r.^ ^ Eiy" E /i(-<) + "j + 2 ( E 11/^11- <s 

i=i ^ j=i ^ ^j=i 



Note that using the squared sum of norms does not change the regularization properties: for all 
solutions of the problem regularized by ^^^^^ ll/j ll' there corresponds a sol ution of the problern 



in Eq . ^ with a different regularization parameter, and vice- versa (see, e.g.. iBorwein and Lewis . 



i 



Section 3.2). The previous formulation encompasses a variety of situations, depending on 
how we set up the input spaces Xi, . . . ,Xp: 

• Regular ^i-norm and group £i-norm regularization: if each Xj is the space of real num- 
bers, t hen we exactly g et back penalization by the £i-norm, and for the square loss, the 



Lasso (iTibshiranil 11996) : if we consider finite dimensional vector sp aces, we get back th e 



block ^i-norm formulation and the group Lasso for the square loss ( Yuan and LinL 20061) . 



Our general Hilbert space formulation can thus be seen as a "non-parametric group Lasso". 
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"Multiple input space, multiple feature spaces": In this section, we assume that we have a 
single input space A! and multiple feature spaces J^i, . . . ,J^p defined on the same input space. 
We could also consider that we have p different input spaces and one feature space J^j per 
Xj, j = 1, . . . ,p, a situation common in generalized additive models. We can go from the 
"single input space, multiple feature spaces" view to the "multiple input space/feature space 
pairs" view by considering p identical copies Xi, . . . ,Xp or while we can go in the other 
direction using projections from X = Xi x ■ ■ ■ x Xp. 



The sparsity-inducing norm formulation defined in Eq. ^ can be seen from several points of views 
and this has led to interesting algorithmic and theoretical developments, which we review in the 
next sections. In this paper, we will build on the approach of Section 1241 but all results could be 
derived through the approach presented in Section 1231 and Section l26l 



2.4 Learning convex combinations of kernels 



Pontil and Micchellil (120051) and lRakotomamonjy et al.l (120081) show that 



Ell/. I 



mill > 

1JC=1^ 



11/^, 



where the minimum is attained at Q = \\fj\\/Ylt=i \\fk\\- This variational formulation of the 
squai^ed sum of norms allows to find an equivalent problem to Eq. namely: 



min mill —'^^V^,'^fi(xi)+b\-\ — 



i=i j=i 



2 ^ Ci 

j=i ''J 



(6) 



Given C G such that iJC = 1, 

using the change of variable fj = fjC,- and ^j{x) 

1 /2 

Q- ^j{x), j = 1, . . . ,p, the problem in Eq. ^ is equivalent to: 



1 " ~ ~ A 

min min - V ^(i/i, (/, + 6) + - 



with respect to /. Thus / is the solution of the single kernel learning problem with kernel 



j=l j=l 



This shows that the non-parametri c group Lasso forrnulation amounts in fact t o lear ning implicitly a 
weighted combination of kernels (iBach et al. I l2004al : iRakotomamonjy et al.U2008r) . Moreover, the 
optimal functions fj can then be computed as fj{-) = Q 'Y^^=i cxikj{-, Xi), where the vector a G M" 
is common to all feature spaces J^j, j = 1, . . . ,p. 
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2.5 Conic convex duality 

One can also consid er the convex optimization problem in Eg. (Bl) and d erive the convex dual using 
conic programming dLobo et all 1 19981 : iBach et all. l2004al : iBachl l2008ah : 



f 1 

max <^ y]^*( 

" ^ 1=1 



-nXaj 



A y~ 
— max a Kja 
2 je{i,...,p} 



(V) 



where Kj is the centered kernel matrix associated with the j-th kernel. From the optimality condi- 
tions for second order cones, one can also get that there exists positive weights ( that sum to one, 
such that fj{-) = (j Y17=i (^i^ji'j^i) (see iBach et all l2004al . for details). Thus, both the kernel 
weights and the solution a of the correspond learning problem can be derived from the solution of 
a single convex optimization problem based on second-order cones. Note that this formulation may 
be actually solved for small n with general-purpo se toolboxes for second -order cone programming, 
although QCQP approaches may be used as well (ILanckriet et al.Ll2004ah . 



2.6 Kernel Learning with Semi-definite Programming 

There is another way of seeing the same problem. Indeed, the dual problem in Eq. (|7]l may be 
rewritten as follows: 



max 

oGM", 1Jo=0 



min / ihi(—n\ai) ( "S^ CiKi\a\ , 



(8) 



and by convex duality (IBoyd and Vandenberghel . l2003l : lRockafellaij . ll970f) as: 



min max < ihA—nXaj) ( CiKi\a\ 



(9) 



If we denote G{K) = maxQ,gig,i it^^^q {~n ^l=i ~ ^oi^ Ka^ , the optimal value of 



the single kernel learning problem in Eq. Q with loss I and kernel matrix K (and centered kernel 
matrix K), then the multiple kernel leaimng problem is equivalent to minimizing G{K) over convex 
combinations of the p kernel matrices associated with all p kernels, i.e., equivalent to minimizing 

s(C) = G(E^=iOi^,). 



Th is function G(K), introduced by severa l authors in slightly different contexts (ILanckriet et al. 



2004bl : iPontil and Micchellill2005l : IOng et al.U2005l) leads to a more general kernel learning frame 
work where one can learn more than simply convex combinations of kernels — in fact, any ker- 
nel matrix which is positive semi-definite. In terms of theoretical analysis, results from gen- 
eral kernel classes rnay be brought to bear (ILanckriet et al.l l2004bl : ISrebro and Ben-DavidL l2006l : 
Ying and Campbeli l2009h : however, the special case of convex combination allows the sparsit y 
interpretation and some additional theoretical analysis (lBachll2008al : lKoltchinskii and Yuanl. 120081) . 
The practical and theoretical advantages of allowing more general potentially non convex combina- 
tions (not necessarily w ith positive coefficients) of kernels is still an open problem and subject of 
ongoing work (see, e.g.. lVarma and Babull2009l. and references therein). 



Note that regularizing in Eq. ^ by the sum of squared norms Yl^=i (instead of the 



squared sum of norms), is equivalent to considering the sum of kernels matrices, i.e., K = Y^^j= 



1 Kj. 
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Moreover, if all kernel matrices have rank one, then the kernel learning problem is equivalent to 
an £-\ -norrn problem, for which dedica ted algorithms are usually much more efficient (see, e.g., 
Efron et all l2004l : IWu and Langl. l2008h . 



2.7 Algorithms 



The multiple facets of the multiple kernel learning problem have led to multiple algorithms. The 
first ones were based on the minimizati on of B(C) = G(y^.^-^CjKj) through general-purpose 



toolboxes for semidefinite programming ( Lanckriet et al. . 2004bl : Qng et al. . 2005 ). While this al- 
lows to get a solution with high precision, it is not scalable to mediur n and large - scale p roblems. 
Later, approaches based on conic duality and smoothing were derived ( Bach et al. , 2004al lbl). They 
were based on existing efficient techniques for the support vector machine (S VM) or pote ntially 



other supervised learning problems, namely sequential minimal optimization (|Plati Il998h . Al- 
though they are by design scalable, they require to recode existing learning algorithms and do 
not reuse pre-existing implementations. The latest formulations based on the direct minimiza- 



tion of a cost function that d e pends directly on ( allow to reuse existing code (|Sonnenburg et al 



20061 : iRakotomamonjy et al.l. 120081) and may thus benefit from the intensive optimizations and 
tweaks alre ady carried through. Finally, active set method s have been recently considered for fi- 
nite groups ( Roth and Fischej. 20081 : Obozinski et al. , 20091) . an approach we extend to hierarchical 
kernel learning in Section l44l 



3 Hierarchical Kernel Learning (HKL) 

We now extend the multiple kernel learning framework to kernels which are indexed by vertices in a 
directed acyclic graph. We first describe examples of such graph-structured positive definite kernels 
from Section [3TT] to Section l34l and defined the graph-adapted norm in Section |33] 



3.1 Graph- Structured Positive Definite Kernels 

We assume that we are given a positive definite kernel A; : x A" — > M, and that this kernel can be 
expressed as the sum, over an index set V, of basis kernels k^, v , i.e., for all x, x' G X: 

}xi{^X ^ X ^ — ^ Aj-y {^X X ^ . 
v&V 

For each S F, we denote by J^^ and the feature space and feature map of k^, i.e., for all 

x,x' e X, ky{x,x') = 

Our sum assumption corresponds to a situation where the feature map <^{x) and feature space 
for k are the concatenations of the feature maps ^v{x) and feature spaces !Fy for each kernel k^, 
i.e., T = Wy^y J^v and <I>(x) = (<^„(x))t,gv'- Thus, looking for a certain f ^ T and a predictor 
function /(x) = (/, ^{x)) is equivalent to looking jointly for /„ G for all v , and 

/(rE) = (/,cI>(x)) = J^(/,,cI>„(x)). 

As mentioned earlier, we make the assumption that the set V can be embedded into a directed 
acyclic grap^ Directed acyclic graphs (referred to as DAGs) allow to naturally define the notions 

^Throughout this paper, for simplicity, we use the same notation to refer to the graph and its set of vertices. 
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of parents, children, descendants and ancestors (IDiestell . |2005|). Given a node w £ V, we denote 
by A{w) C V the set of its ancestors, and by D{w) C V, the set of its descendants. We use the 
convention that any u; is a descendant and an ancestor of itself, i.e., w € A{w) and w G D{w). 
Moreover, for W C V, we let denote sources(l^) the set of sources (or roots) of the graph V 
restricted to W, that is, nodes in W with no parents belonging to W. 

Moreover, given a subset of nodes W C V, we can define the hull of W as the union of all 
ancestors of w G W, i.e., 

hull(VF) = (j A{w). 

Given a set W, we define the set of extreme points (or sinks) of W as the smallest subset T C W 
such that hull(T) = hull(l^); it is always well defined, as (see Figure |2] for examples of these 
notions): 

sinks(VF) = Pi T. 

TCV, hull(T)=hull(Vl/) 

The goal of this paper is to perform kernel selection among the kernels k^, v V. We essentially 
use the graph to limit the search to specific subsets of V. Namely, instead of considering all possible 
subsets of active (relevant) vertices, we will consider active sets of vertices which are equal to their 
hulls, i.e., subsets that contain the ancestors of all their elements, thus limiting the search space (see 
Section [33] ). 



3.2 Decomposition of Usual Kernels in Directed Grids 

In this paper, we primarily focus on kernels that can be expressed as "products of sums", and on 
the associated p-dimensional directed grids, while noting that our framework is applicable to many 
other kernels (see, e.g.. Figure |4l). Namely, we assume that the input space X factorizes into p 
components X = x ■ ■ ■ x Xp and that we are given p sequences of length g + 1 of kernels 
kij{xi, x[), i G {1, . . . j G {0, . . . , q}, such that (note the implicit different conventions for 
indices in ki and kij): 

p p / q \ q p 

k{x,x') = '[[ki{xi,x'i} = Y[i'^kij{xi,x'i)j= ^ ]J%(xi,x-). (10) 

i=l i=l ^ j=0 ^ ji,...,jp=0 i=l 

Note that in this section and the next section, Xj refers to the z-th component of the tuple x = 
(xi, . . . , Xp) (while in the rest of the paper, Xj is the i-th observation, which is itself a tuple). We 
thus have a sum of {q + 1)^ kernels, that can be computed efficiently as a product of p sums of 
q + I kernels. A natural DAG on y = {0, . . . , q}^ is defined by connecting each (ji, . . . ,jp) 
respectively to {ji + 1, j2, • • • , ip), • • • , (ji, • • • , jp-i,jp + 1) as long as ji < q,..., jp < q, re- 
spectively . As shown in Section 13.51 this DAG (which has a single source) wiU correspond to 
the constraint of selecting a given product of kernels only after all the subproducts are selected. 
Those DAGs are especially suited to non-linear variable selection, in particular with the polyno- 
mial, Gaussian and spline kernels. In this context, products of kernels correspond to interactions 
between certain variables, and our DAG constraint implies that we select an interaction only after 
all sub-interacti ons were a l ready selected, a constraint that is similar to the one used in multivariate 



additive splines (IFriedmanl. 1 1 99 ih . 
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Figure 2: Examples of directed acyclic graphs (DAGs) and associated notions: (top left) 2D-grid 
(number of input variables p = 2, maximal order in each dimension q = 4); (top right) example of 
spai^sity pattern which is not equal to its hull ( x in light blue) and (bottom left) its hull ( x in light 
blue); (bottom right) dark blue points (x) are extreme points of the set of all active points (blue x); 
dark red points (+) are the sources of the complement of the hull (set of all red +). Best seen in 
color. 
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Figure 3: Directed acyclic graph of subsets of size 4: (left) DAG of subsets {p = 4, q = 1); (right) 
example of sparsity pattern (light and dark blue), dark blue points are extreme points of the set of 
all active points; dai^k red points aie the sources of the set of all red points. Best seen in color. 



Polynomial kernels. We consider Xi = M, = (1 + XjX^"^ and for all j G {0, . . . , q}, 

kij{xi, x[) = (p {xix'^y-, the full kernel is then equal to 

k{x,x')=\{{l + X,xy= \{{'^){x^x[r. 

i=l ji,-Jp=0 i=l ^-^'^ 

Note that this is not exactly the usual polynomial kernel {I + x'^ x')'^ (whose feature space is the 
space of multivariate polynomials of total degree less than q), since our kernel considers polynomi- 
als of maximal degree q. 



Gaussian kernels (Gauss-Hermite decomposition). We also consider Xi = M, and the Gaussian- 
RBF kernel e~^^^'~^^^ with 6 > 0. The following decomposition is the eigendecomposition of the 
non centered covariance operator correspondi ng to a normal distribution with variance l/4a (see, 
e.g. JwiUiams and Seegei].(2000l:lBachl . l2008ah : 



'l-^Y''f,^^e~'ki^^'^)^iH,{V2^^^^ (11) 
/ i=o ^' 



where c? = + 2ab, A = a + b + c, and Hj is the j-th Hermite polynomial ( Szegol . 1981 ). By 
appropriately truncating the sum, i.e., by considering that the first q basis kernels are obtained from 
the first q Hermite polynomials, and the {q + l)-th kernel is summing over all other kernels, we 
obtain a decomposition of a uni-dimensional Gaussian kernel into q + I components (the first q of 
them are one-dimensional, the last one is infinite-dimensional, but can be computed by differenc- 
ing). The decomp osition ends up being close to a poly nomial kernel of infinite degree, modulated 
by an exponential (|Shawe-Tavlor and Cristianinil.l2004r). One may al s o use an adaptive decomposi - 
tion using kernel PC A (see, e.g. Ishawe-Taylor and Cristianini . 2004 : SchoUcopf and Smola . 2002), 
which is equivalent to using the eigenvectors of the empirical covariance operator associated with 
the data (and not the popula tion one associated with the Gaussian distribution with same variance). 
In prior work ( Bachl. 2008b), we tried both with no significant differences. 



All-subset Gaussian kernels. When q = 1, the directed grid is isomorphic to the power set (i.e., 
the set of subsets, see Figure |3]l with the DAG defined as the Hasse diagram of the partially ordered 
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Figure 4: Additional examples of discrete structures. Left: pyramid over an image; a region is se- 
lected only after all larger regions that co ntains it are selected. Ri ght: set of substrings o f size 3 from 



the a lphabet {A,B}; in bioinformatics (|Scholkopf et al.L 12004) and text processing (jLodMetal 



20021) . occurence of certain potentially long strings is an important feature and considering the 
structure may help selecting among the many possible strings. 



set of all subset s (ICameronl. 11994ft. In this setting , we can decompose the all-subset Gaussian 
kernel (see, e.g.. lShawe-Taylor and Cristianinil . 120041 ') as: 



11(1 + 



ae 



E 

Jc{i,.--,p}*eJ 



n 



ae 



E 

JC{1,... 



■b\\x.-i 



x',r 



p} 



and our framework will select the relevant subsets for t he Gaussian kernels, w ith the DAG presented 
in Figure [3l A similar decomposition is considered by Lin and Zhang ( 20061) . but only on a subset 
of the power set. Note that the DAG of subsets is different frorn the "kernel graphs" introduced for 
the same type of kernel by IShawe-Taylor and Cristianinil (|2004h for expliciting the computation of 
polynomial kernels and ANOVA kernels. 



Kernels on structured data. Although we mainly focus on directed grids in this paper, many 
kernels on structured data can also be naturally d ecomposed through a hierarchy (see FigurelH), such 



as the pyramid match kernel and related kernels (iGrauman and Darrelj 120071: ICuturi and Fukumizu 



2006h . string kernels or graph kernels (see, e.g., IShawe-Taylor and Cristianinil . l2004l) . The main 
advantage of using -norms inside the feature space, is that the method will adapt the complexity 
to the problem, by only selecting the right order of complexity from exponentially many features. 



3.3 Designing New Decomposed Kernels 

As shown in Section [51 the problem is well-behaved numerically and statistically if there is not too 
much correlation between the various feature maps u G y. Thus, kernels such as the the all- 
subset Gaussian kernels may not be appropriate as each feature space contains the feature spaces 
of its ancestor^ Note that a strategy we could follow would be to remove some contributions of 
all ancestors by appropriate orthogonal projections. We now design specific kernels for which the 
feature space of each node is orthogonal to the feature spaces of its ancestors (for well-defined dot 
products). 

^More precisely, this is true for the closures of these spaces of functions. 
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Spline kernels. In Eq. (flOl ). we may chose, with q = 2: 



ki2ixi, Xj) 



1 

/ 

min{|xi|, |x^|}^(3max{|xj|, |x-|} — mm{|xj|, |x^|})/6, if XjX^ ^ 
0, otherwise, 



leading to tensor products of one-dimensional cubic spline kernels dWahbal. 1 199d : iGul . |2002|) . This 
kernel has the advantage of (a) being parameter free and (b) explicitly starting with line ar features 
and e ssentially provides a convexification of multivariate additive regression splines (IFriedman 



1991 



). Note that it may be more efficient here to use natural spUnes in the estimation method (|Wahbal. 



19901) than using kernel matrices. 



Hermite kernels. We can start from the following identity, valid for a < 1 a nd from which the 
decomposition of the Gaussian kernel in Eq. ([TT]) may be obtained ( Szegol. 1981 ): 



oo 



Hj{xi)Hj{x'i) = (1 - a2)-V2exp 



2a{xi - x[Y (xf + {x[Y)a 



+ 



1 + a 



We can then define a sequence of kernel which also starts with linear kernels: 



kio{Xi, Xj) 
kij (xj , Xj) 

kiq{xi, Xj) 



Hoix)Ho{x') = 1 



H,{x)Hjix')forje{l,...,q-l} 



j=q 



Most kernels that we consider in th i s sect ion (except the polynomial kernels) are universal ker- 
nels ( Micchelli et al. , 20061 : Steinwart . 2002), that is, on a compact set of MP, their reproducing 
kernel Hilbert space is dense in L'^{MP). This is the basis for the universal consistency results in 
Section 15.31 Moreover, some kernels such as the spline and Hermite kernels explicitly include the 
linear kernels inside their decomposition: in this situation, the sparse decomposition will start with 
linear features. In Section 15. 3[ we briefly study the universality of the kernel decompositions that 
we consider. 



3.4 Kernels or Features? 

In this paper, we emphasize the kernel view, i.e., we assume we are given a positive definite ker- 
nel (and thus a feature space) and we explore it using -norms. Alternatively, we could use the 
feature view, i.e., we would assume that we have a large structured set of features that we try to 
select from; however, the techniques developed in this paper assume that (a) each feature might 
be infinite-dimensional and (b) that we can sum all the local kernels efficiently (see in particu- 
lar Section 14.21). Following the ker nel view thus seems slightly more natural, but by no means 
necessary — see IJenatton et al.l (120091) for a more general "feature view" of the problem. 
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In order to apply our optimization techniques in the feature view, as shown in Section |4j we 
simply need a specific upper bound on the kernel to be able to be computed efficiently. More 

precisely, we need to be able to compute Yjw&{t) (Z]i>gA(«))nD(t) '^v) for all t e V , or an 

upper bound thereof, for appropriate weights (see Section |42l for further details). 

3.5 Graph-Based Structured Regularization 

Given / G HiiGy ^^'^ natural Hilbertian norm ||/|| is defined thi-ough = X^^gy IIA'P- 
Penalizing with this norm is efficient because summing all kernels is assumed feasible in poly- 
nomial time and we can bring to bear the usual kernel machinery; however, it does not lead to sparse 
solutions, where many /„ will be exactly equal to zero, which we try to achieve in this paper. 

We use the DAG to limit the set of active patterns to certain configurations, i.e., sets which are 
equal to their hulls, or equivalenty sets which contain all ancestors of their elements. If we were 
using a regularizer such as ||/^|| we would get sparse solutions, but the set of active kernels 

would be scattered throughout the graph and would not lead to optimization algorithms which are 
sub-linear in the number of vertices \ V\. 

All sets which are equal to their hull can be obtained by removing all the descendants of certain 
vertices. Indeed, the hull of a set / is characterized by the set of v, such that T){v) C P, i.e., such 
that all descendants of v are in the complement I'^ of I: 

hull(I) ={veV, V){v) C 

Thus, if we try to estimate a set I such that hull(I) = /, we thus need to determine which v ^ V 
are such that D(t;) C P. In our context, we are hence looking at selecting vertices f G F for which 
/d(d) = {fw)w&{v) = 0- We thus consider the following structured block ^i-norm defined on 
JF = jFi X • • • X JPp as 

where (d^)„gy are strictly positive weights. We assume that for all vertices but the sources of the 
DAG, we have = ^depth{?;) ^j^j^ /3 > 1, where depth(f ) is the depth of node v, i.e., the length of 
the smallest path to the sources. We denote by dr S (0, 1] the common weights to all sources. Other 



weigh ts could be considered, in particular-, weights inside the blocks D{v) (see, e.g. lJenatton et al 



20091 ). or weights that lead to penalties closer to the Lasso (i.e., /3 < 1), for which the effect of the 
DAG would be weaker. Note that when the DAG has no edges, we get back the usual block ^i-norm 
with uniform weights dr, and thus, the results presented in this paper (in particular the algorithm 
presented in Section 14.41 and non-asymptotic analysis presented in Section 15.2b can be applied to 
multiple kernel learning. 

Penalizing by such a norm will indeed impose that some of the vectors /j^^^j G P^^gj^^^j 
are exactly zero, and we show in Section [5TT] th at these are the only patterns we might get. We thus 
consider the following minimization problertlj: 

2 



mm 



i=l ^ v£V ^ ^v&V ^ 



'Following iBach et aL l( l2004iJ) andSectionl2l we consider the square of the norm, which does not change the regular- 
ization properties, but allow simple links with multiple kernel learning. 
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Figure 5: Directed acyclic graph of subsets of size 4: (left) a vertex (dark blue) with its ancestors 
(light blue), (right) a vertex (dark red) with its descendants (light red). By zeroing out weight vectors 
associated with descendants of several nodes, we always obtained a set of non-zero weights which 
contains all of its own ancestors (i.e., the set of non-zero weights is equal to its hull). 



Our no rm is a Hilbert space instantiation of the hierarchical norms recently introduced by lZhao et al. 



(|2009r) . If all Hilbert spaces are finite dimensional, our particular choice of norms corresponds to an 
"£i-norm of ^2-norms". While with uni-dimensional groups or kernels, the -n orm of ^np-norms " 
allows an efficient path algorithm for the square loss and when the DAG is a tree ( Zhao et al. . 2009h . 
this is not possible any more with groups of size larger than one, or when the DAG is not a tree (see 



Szafranski et al.L l2008l. for examples on two-layer hierarchies). In Section IH we propose a novel 
algorithm to solve the associated optimization problem in polynomial time in the number of selected 
groups or kernels, for all group sizes, DAGs and losses. Moreover, in Section [51 we show under 
which conditions a solution to the problem in Eq. ([T3] ) consistently estimates the hull of the sparsity 
pattern. 



4 Optimization 

In this section, we give optimality conditions for the problems in Eq. (fTSl ). as well as optimization 
algorithms with polynomial time complexity in the number of selected kernels. In simulations, 
we consider total numbers of kernels up to 4?^^, and thus such efficient algorithms that can take 
advantage of the sparsity of solutions are essential to the success of hierarchical multiple kernel 
learning (HKL). 



4.1 Reformulation in terms of Multiple Kernel Learning 

FoUowing lRakotomamonjy et al. I (l2008b . we can simply derive an equivalent formulation of Eq. ([T3] ). 
Using Cauchy-Schwarz inequality, we have that for all rj G M.^ such that J2veV ^v''lv ^ 1, a varia- 
tional formulation of 0(/)^ defined in Eq. ([12]) : 



dv\\f-D{v)\ 



II/d(i>)IP 



rjv 



fui II ) 
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with equality if and only if for all w G V r]^ = d^^\\fDiv)\\ {'Ewev dw\\fD{w)\\) ^ = '^^ ^(/f 
We associate to the vector i] G M^, the vector C, G such that 

Vu; G y, C«,(r?)"' = ■ (14) 

t;gA(«)) 

We use the natural convention that if rj^, is equal to zero, then Cw>(^) is equal to zero for all de- 
scendants w of V. We let denote H = {tj ^ R^, St^ev ^ 1} set of allowed t] and 
Z = {C{^), rj G if} the set of all associated C,{r]) for rj ^ H. The set i7 and Z are in bijection, and 
we can interchangeably use ?] G if or the corresponding C,{r]) G Z. Note that Z is in general not 
convex (unless the DAG is a tree, see Proposition |9] in Appendix lA. lb . and if C G Z, then ^ (■„ 
for all w G D(7;), i.e., weights of descendant kernels ai^e always smaller, which is consistent with 
the known fact that kernels should always be selected after all their ancestors (see Section ISTT] for a 
precise statement). 

The problem in Eq. (fT3l) is thus equivalent to 



1 

min min — > j 



i;GV ^ w&V 



From Section 121 we know that at the optimum, fw = Cw{'>]) XlILi (^i^wixi) G T^, where a G M'^ 
are the dual parameters associated with the single kernel learning problem in Proposition \T\ with 
kernel matrix Y.w(^v Cw{v)Kw 

Thus, the solution is entirely determined by a G M" and rj ^ H C. (and its corresponding 
C,{r]) G Z). We also associate to a and r] the corresponding functions fw, w G V, and optimal 
constant b, for which we can check optimality conditions. More precisely, we have (see proof in 
Appendix IA.4I ): 

Proposition 2 (Dual problem for HKL) The convex optimization problem in Eq. f [7?l) has the fol- 
lowing dual problem: 

1 A 

max i/jii—nXai) max (y^(rj)a^ Ky^a. (16) 

QeM", i,Ta=o 2 v&H ^ 

Moreover, at optimality, G V, = Cu>(^) Ya=i cti^w{xi) and b = b* (ZlioGy Cwi'n)J^wa), 
with T] attaining, given a, the maximum of^^^^y C,w{i])o^ K^ci. 

Proposition 3 (Optimality conditions for HKL) Let (a, r^) G x H, such that 1 = 0. Define 
functions f through Vw G V,fw = Cwiv) Yl7=i cti^wixi) and b = b* (XluieF Cw{i])Kwa) the 
corresponding constant term. The vector of functions f is optimal for Eq. ([13\l . if and only if: 

(a) given rj £ H, the vector a is optimal for the single kernel learning problem with kernel matrix 

(b) given a, rj £ H maximizes 

J2 {^v€A{w)Vv^) K^a = ^ Cw{ri)a^ K^a. (17) 

w&V w&V 
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Moreover, as shown in Appendix IA.4[ the total duality gap can be upperbounded as the sum of 
the two separate duality gaps for the two optimization problems, which will be useful in Section 14^2] 
for deriving sufficient conditions of optimality (see Appendix |A.4| for more details): 

where gap^gigi^^g corresponds to the duality gap of Eq. ([TT] ). Note that in the case of "flat" regular 
multi ple kernel learning, where the DAG has no edges , we o btain back usual optimality condi- 
tions (|Rakotomamonjy et all llOOSl : IPontil and Micchellil . l2005h . 



Following a common practice for convex sparse problems (iLee et all 120071 : iRoth and Fischeil. 

l2008h . we will try to solve a small problem where we assume we know the set of v such that ||/D(t,) || 
is equal to zero (Section |431 ). We then need (a) to check that variables in that set may indeed be left 
out of the solution, and (b) to propose variables to be added if the current set is not optimal. In the 
next section, we show that this can be done in polynomial time although the number of kernels to 
consider leaving out is exponential (Section 1421) . 

Note that an alternative approach would be to consider the regular multiple kernel learning 
problem with additional linear constraints ^ for all non-sources v . However, it would 
not lead to the analysis through sparsity-inducing norms outlined in Section [5] and might not lead to 
polynomial-time algorithms. 

4.2 Conditions for Global Optimality of Reduced Problem 

We consider a subset of F which is equal to its hull — as shown in Section [STTl those are the only 
possible active sets. We consider the optimal solution / of the reduced problem (on W), namely, 

min - V^fy./, + V 4||/D(„)nw^||V (19) 

with optimal primal variables /^y, dual variables a G M" and optimal pair (r/vKi Cw)- From these, 
we can construct a full solution / to the problem, as /i^c = Q, with ryvi^c = 0. That is, we keep a 
unchanged and add zeros to r^vy- 

We now consider necessary conditions and sufficient conditions for this augmented solution to 
be optimal with respect to the full problem in Eq. ( fT3l ). We denote by = Z^^giy ll/D(?;)niy II 
the optimal value of the norm for the reduced problem. 

Proposition 4 (Necessary optimality condition) If the reduced solution is optimal for the full prob- 
lem in Eq. di il ) and all kernels indexed by W are active, then we have: 



Kta ^ 

tgsources(VF'=) d\ 



max ' ^ 0(/)2. (20) 



Proposition 5 (Sufficient optimality condition) If 



E , ^. ^n{ff + 2e/X, (21) 



then the total duality gap in Eq. M8^ is less than e. 
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The proof is fairly technical and can be found in Appendix IA.51 this result constitutes the main tech- 
nical result of the paper: it essentially allows to design an algorithm for solving a large optimization 
problem over exponentially many dimensions in polynomial time. Note that when the DAG has no 
edges, we get back regular" conditions for unstructured MKL — for which Eq. (l20l ) is equivalent to 
Eq. (EB for e = 0. 

The necessary condition in Eq. (l20l ) does not cause any computational problems as the number 
of sources of W^, i.e., the cardinal of sources (VF^^), is upper-bounded by \W\ times the maximum 
out-degree of the DAG. 

However, the sufficient condition in Eq. (|2TI) requires to sum over all descendants of the active 
kernels, which is impossible without special structure (namely exactly being able to compute that 
sum or an upperbound thereof). Here, we need to bring to bear the specific structure of the full 
kernel k. In the context of directed grids we consider in this paper, if dy can also be decomposed 
as a product, then X^^gA(«;)nD(t) ^^^^ factorized, and we can compute the sum over all 

u G D(t) in Mnear time in p. Moreover, we can cache the sums 

= X] (Si)eA(u.)nD{t) Kw 

in order to save running time in the active set algorithm presented in Section 14.41 Finally, in the 
context of directed grids, many of these kernels are either constant across iterations, or change 
slightly; that is, they are product of sums, where most of the sums are constant across iterations, 
and thus computing a new cached kernel can be considered of complexity O(n^), independent of 
the DAG and of W. 



4.3 Dual Optimization for Reduced or Small Problems 



In this section, we consider solving Eq. (1131 ) for DAGs V (or active set W) of small cardinality, 
i.e., for (very) small problems or for the reduced problems obtained from the algorithm presented 
in Figure [6]from Section l44l 

When kernels ky,v , have low-dimensional feature spaces, either by design (e.g., rank one if 
each node of the graph corresponds to a single feature), or after a low-r ank decomposition such as a 
singular value decompo sition or an incomplete Cholesky factorization ( Fine and Scheinberg . 200 ll : 
Bach and Jordanl. 120051) . we may use a "primal representation" and solve the problem in E q. ([T3]) 
using generic optimization toolboxes adapted to conic constraints (see, e.g. JGrant and Boydl . l2008i) . 
With high-dimensional feature spaces, in order to reuse existing optimized supervised learning code 
and use high-dimensional kernel s, it is preferable to use a "dual optimization". Namely, we fol- 
low |R^otomamonjy_eLa]J(|2008|), and consider for C, ^ Z, the function 



B{C) = GiKiO) 



1 

mill — y 



yi,^{fv,'^v{xi)) + b 

vev 



+ 



X 



E 



which is the optimal value of the single kernel learning problem with kernel matrix Ylwev CwKw 
Solving Eq. ([T5] ) is equivalent to minimizing B{({ri)) with respect to rj £ H. 

If the Fenchel conjugate of the loss is strictly convex (i.e., squai^e loss, logistic loss, Hiiber loss, 
2-norm support vector regression), t hen the function B is differen tiable — because the dual problem 
in Eq. ^ has a unique solution a ( Bonnans and Shapirol. 2000 ). When the Fenchel conjugate is 
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not strictly convex, a ridge (i.e., positive diagonal matrix ) may be added to the kernel matri ces, 
which has the exact effect of smoothing the loss — see, e.g.. lLemarechal and Sagastizaball (119971) for 
more details on relationships between smoothing and adding strongly convex functions to the dual 
objective function. 

Moreover, the function rj ^ is differentiable on (M^)^, but not at any points rj such that 
one r/u is equal to zero. Thus, the function rj B[({{1 — e)rj + pp^d"^)] , where d"^ is the vector 
with elements is differentiable if e > 0, and its derivatives can simply be obtained from the 
chain rule. In simulations, we use e = 10~^; note that adding this term is equivalent to smoothing 
the norm (i.e., make it differentiable), while retaining its sparsity-inducing properties (i.e., 
some of the optimal rj will still be exactly zero). 

We can then use the same projected gradient descent strategy as lRakotomamonjy et al.l(l2008h to 
minimize it. The overall complexity of the algorithm is then proportional to 0(|y|n^) — to form the 
kernel matrices — added to the complexity of solving a single kernel learning prob lem — typically be- 
tween O(n^) and Oiv?), using proper kernel classification/regressio n algorithms (IVishwanathan et al 



20031: Loosli et al.L 120051) . Note that we could follow the approach of lChapelle and Rakotomamonjy 



(|2008h and consider second-order methods for optimizing with respect to i]. 



4.4 Kernel Search Algorithm 



We now present the detaile d algorithm which extends the search algorithm of iLee et al.l (120071) 



and iRoth and Fischeii (120081) . Note that the kernel matrices are never all needed explicitly, i.e., we 
only need them (a) explicitly to solve the small problems (but we need only a few of those) and (b) 
implicitly to compute the necessary condition in Eq. (l20l ) and the sufficient condition in Eq. (|2TI ). 
which requires to sum over all kernels which are not selected, as shown in Section l4!2l 

The algorithm works in two phases: first the (local) necessary condition is used to check op- 
timality of the solution and add variables: when those are added, the augmented reduced problem 
must include the new variable into the active set. Once the necessary condition is fulfilled, we use 
the sufficient condition, which essentially sums over all non selected kernels and makes sure that if 
some information is present further away in the graph, it will indeed be selected. See Figured for 
detailfl 

The algorithm presented in Figure |6] will stop either when the duality gap is less than 2e or when 
the maximal number of kernels Q has been reached. That is, our algorithm does not always yield a 
solution which is provably approximately optimal. In practice, when the weights increase with 
the depth of v in the DAG (which we use in simulations), the provably small duality gap generally 
occurs before we reach a problem larger than Q (however, we cannot make sharp statements). Note 
that some of the iterations only increase the size of the active sets to check the sufficient condition 
for optimality. Forgetting those would not change the solution as we add kernels with zero weights: 
however, in this case, we would not be able to actually certify that we have an 2e-optimal solution 
(see Figure |7] for an example of these two situations). Note that because of potential overfitting 
issues, settings of the regularization parameter A with solutions having more than n active kernels 
are likely to have low predictive performance. Therefore, we may expect the algorithm to be useful 
in practice with moderate values of Q. 



*Matlab/C code for least-squares regression and logistic regression may be downloaded from the author's website. 
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Input: Kernel matrices G R"^", weights dy, v £ V, maximal gap £, 

maximal number of kernels Q. 

Algorithm: 

1. Initialization: active set W = 0, cache kernel matrices K^, w G sources(M^'^) 

2. Compute (a, rj) solutions of Eq. (fT9l ). obtained using Section l43] (with gap e) 

3. While necessary condition in Eq. (l20l ) is not satisfied and ^ Q 

a. Add violating kernel in sources(VF'^) to W 

b. Compute (a, r]) solutions of Eq. ([T9l ). obtained using Section l43] (with gap e) 

c. Update cached kernel matrices K^, w G sources(iy^) 

4. While sufficient condition in Eq. (|2TI) is not satisfied and \ W\ ^ Q 

a. Add violating kernel in sources (1^^) to W 

b. Compute (a, r]) solutions of Eq. ([T9l ). obtained using Section 1431 (with gap e) 

c. Update cached kernel matrices K^, w G sources(VF'^) 
Output: W, a, rj, constant term b 

Figure 6: Kernel search algorithm for hierarchical kernel learning. The algorithm stops either when 
the duality gap is provably less than 2e, either when the maximum number of active kernels has 
been achieved; in the latter case, the algorithm may or may not have reached a 2e-optimal solution 
(i.e., a solution with duality gap less than 2e). 



Running-time complexity. Let D be the maximum out-degree (number of children) in the graph, 
K be the complexity of evaluating the sum in the sufficient condition in Eq. (|2TI) (which usually 
takes constant time), and R = \W\ the number of selected kernels (the number is the size of the 
acti ve set W). Assuming Oinf) for the single ker nel learning problem, which is conservative (see, 
e.g. Vishwanathan et al. . 20031 : Loosli et al. . 2005 , for some approaches), solving all reduced prob- 
lems has complexity 0{Rn^). Computing all cached matrices has complexity 0{Kn? x RD) 
and computing all necessary/sufficient conditions has complexity 0(n^ x R^D). Thus, the to- 
tal complexity is 0{Rn^ + kv?RD + v?R^D). Thus, in the case of the directed p-grid, we get 
0{Rn^ + n?'R^p). Note that the kernel search algorithm is also an efficient algorithm for unstruc- 
tured MKL, for which we have complexity 0{Rv? + v?R^p). Note that gains could be made in 
terms of scaling wi th respect to n by using better kernel rnachine codes with complexity between 
0(^2) and 0{rfi) dvishwanathan et all . I2OO3I : lUoosli et all. boOSh . Note that while the algorithm 
has polynomial complexity, some work is still needed to make it scalable for more than a few 
hundreds variables, in particular- because of the memory requirements of 0{Rpn?). In order to 
sav e storing requirements for the cached kernel rnatric es, low-rank decompositions might be use- 
ful (IFine and Scheinbergl . 1200 ll : iBach and Jordanl. l2005h . 



5 Theoretical Analysis in High-Dimensional Settings 

In this section, we consider the consistency of kernel selection for the norm ^{f) defined in Sec- 
tion [3] In particular, we show formally in Section [5?T] that the active set is always equal to its hull, 
and provide in Section [5^ conditions under which the hull is consistently estimated in low and high- 
dimensional settings, where the cardinality of V may be large compared to the number of observa- 
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>6-^<>-^<>-^ 



>o-^<>-^ 



>k A A A 



K>-^<>-^<>-^ 



A >^ A A A 



Figure 7: Example of active sets for the kernel search algorithms: (left) first phase, when checking 
necessary conditions, the dark blue nodes (x) are the active kernels (non-zero t]), and the red + are 
the sources of the complement, which may be added at the next iteration; (right) second phase, when 
checking sufficient conditions, the dark blue nodes (x) are the active kernels (non-zero r]), the light 
blue nodes (x) are the kernels with zero weights but are here just to check optimality conditions, 
and the red nodes (+) are the sources of the complement, which may be added at the next iteration. 



tions. Throughout this section, we denote by / any minimizer of Eq. ( fT3] ) and W = {v e V, fy 0} 
the set of selected kernels. 



5.1 Allowed Patterns 



We now show that under certain assumptions any solution of Eq. ([T3] ) will have a nonzero pattern 
wh ich is equal to i t s hull , i.e., the set W = {v ^ V, 0} must be such that W = U«,gi4/ ^{w) 



see 



Jenatton et all (|2009l) for a more general result with overlapping groups without the DAG struc- 



ture and potentially low-rank kernels: 

Theorem 6 (Allowed patterns) Assume that all kernel matrices are invertible. Then the set of 
zeros W of any solution f ofEq. di il ) is equal to its hull. 



Proof Since the dual problem in Eq. (fT6l) has a strictly convex objective function on the hyperplane 
In = 0, the minimum in a G M" is unique. Moreover, we must have a 7^ as soon as the loss 
functions Lpi are not all identical. Since = C^ct^-f^wO for some C, ^ Z, and all K^^a > 

(by invertibility of Ku, and a^l„ = 0), we get the desired result, from the sparsity pattern of the 
vector Q G M^, which is always equal to its hull. ■ 



As shown above, the sparsity pattern of the solution of Eq. (1131 ) will be equal to its hull, and thus 
we can only hope to obtain consistency of the hull of the pattern, which we consider in the next 
sections. In Section 15^ we provide a sufficient condition for optimality, whose weak form tends to 
be also necessary f or consistent estimation o f the hull; these results extend the one for t he Lasso and 
the gro up Lasso ( Zhao and Yu . 2006 : Zou . 20061 : Yuan and Lin . 2007 : Wainwright . 2009; Bach . 
2008ah . 
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5.2 Hull Consistency Condition 

For simplicity, we consider the square loss f o r regre ssion and leave out other losses presented in 
Section im for future work. Following BachI ( 2008a ). we consider a random design setting where 
the pairs {xi,yi) G X x y are sampled from independent and identical distributions. We make 
the following assumptions on the DAG, the weights of the norm and the underlying joint distribu- 
tion of ($„(X))„gy and Y. These assumptions rely on covariance operators, which are the tools 
of choice for anal yzing supervi s ed and unsupervised learning techniques with reproducing kernel 
Hilbert spaces (see lBachl.E008al : iFukumizu et al.l.l2007l : lHarchaoui et allboOSL for a introduction to 
the main concepts which are used in this paper). We let denote S the joint covariance operator for 
the kernel k{x,y) defined by blocks corresponding to the decomposition indexed by V. We make 
the following assumptions: 

(AO) Weights of the DAG: Each of the num(y) strongly connected components of V has a unique 
source; the weights of the sources are equal to dr € (0, 1], while all other weights are equal 
to dy = ^dcpth(ii) ^jjj^ /3 > 1. The maximum out-degree (number of children) of the DAG is 
less than deg{V) — 1. 

(Al) Sparse non-linear model: E(y|X) = '}2wew{^wi-^) + ^ ^i'^^ W C y, ftj, G T^, w G W, 
and b G M; the conditional distribution of Y\X is Gaussian with variance cr^ > 0. The set 
W is equal to its hull, and for each w G W, fD(tu)nw / (i-C-> the hull of the non zero 
functions is actually W). 

(A2) Uniformly bounded inputs: for all v ^ V, \\^y(X)\\ ^ 1 almost surely, i.e., ky{X, X) ^ 1. 

(A3) Compacity and invertibility of the correlation operator on the relevant variables: The joint 
correlation operator C of {^{xy))y^v (defined with appropriate blocks C^w) is such that 
Cww is compact and invertible (with smallest eigenvalue k = Amin(C'ww) > 0). 

(A4) Smoothness of predictors: For each w G W, there exists h^„ G such that = HwvMw 
and \\h.w\\ ^ 1. 

(A5) Root-summability of eigenvalues of covariance operators: For each w G W, the sum of the 
square roots of the eigenvalues of is less than a constant Ci/2- 

When the Hilbert spaces all have finite dimensions, covariance operators reduce to covariance 
matrices, and Assumption (A3) reduces to the invertibility of the coiTclation matrix Cww (as it is 
always compact) and thus of the cov arianc e matrix 5]ww> while (A4) and (AS) are always satisfied. 
These assumptions are discussed by iBachI (|2008ah in the context of multiple kernel learning, which 
is essentially our framework with a trivial DAG with no edges (and as many connected components 
as kern els). Note however that Assumption (A4) is slightly strong e r than the one used by BachI 
teOOSah and that we derive here non asymptotic results, while iBachI (l2008ah was considering only 
asymptotic results. 

For K a subset of V, we denote by Qk I/k) = y^^.^^j^ rf^JI fn(7.)ny I L the norm reduced to 



the functions in K and by ill- its dual norm (IBoyd and Vandenberg 



)(,,,) nprW, me norm reuuceu l o 
M I2OO3I : iRockafellail . Il970l) . 



defined as ^^igK) = maxQj^rf^^iigK, Jk)- We consider sw G {^v)v&w, defined through 



Vu; G W, 



E 



dJ\{] 



D(v) I 
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When the DAG h as no edges, i .e., for the regular group Lasso, we get back similar quantities than the 
ones obtained by Bachl l 2008a ): if in addition, the feature spaces are all u ni-dimensional, we get the 
vector of signs of the relevant variables, recov ering the Lasso conditions dZhao and Yul . l2006l : IZou , 
2006l : lYuan and Linll2007uWainwrighti . l2009l ). The following theorem shows that if the consistency 
condition in Eq. (I22l ) is satisfied, then we can upperbound the probability of incorrect hull selection 
(see proof in Appendix iBl): 



Theorem 7 (Sufficient condition for hull consistency) Assume (AO-5) and 



s; 1 



(22) 



with T] > 0; let v = miiit^jgw || ^''^^?>{^vv)'d{w)^'D{w)\\ and oj = rt{t)d^ Let 



l{V) 



41og(2num(y)) 41ogdeg(y) 



(1-/3-1)^ 



+ 



{log I3f 



Choose fi = XQ{{)dr € 
upper-bounded by: 



exp 



^172 ) cjll/2|W|V2 

/in 



The probability of incorrect hull selection is 



2 

fi n 

^ ) + exp ( - C2 



io-^\W\ 



+ exp - C3 



(23) 



where ci, C2, C3 are positive monomials in k, v, rj and C^J^. 

The previous theorem is the main theoretical contribution of this paper. It is a non-asymptotic 
result which we comment on in the next paragraphs. The proof relies on novel concentration in- 
equalities for empirical covariance operators and for structured norms, which may be useful in other 
settings (see results in Appendices IB . 2[ IB3] and lB4l) . Note that the last theorem i s not a consequence 
of similar results for flat rn ultiple kernel learning or group Lasso ( Bach . 2008a : Nardi and Rinaldol. 
Lounici et al. . 2009h . because the groups that we consider are overlapping. Moreover, the 
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last theorem shows that we can indeed estimate the conect hull of the sparsity pattern if the suffi- 
cient condition is satisfied. In particular, if we can make the groups such that the between-group 
correlation is as small as possible, we can ensure correct hull selection. 



Low-dimensional settings. When the DAG is assumed fixed (or in fact only the number of con- 
nected components num(F) and the maximum out-degree deg(y)) and n tends to +00, the prob- 
ability of incorrect hull selection tends to zero as soon as An^/^ tends to +00 and A tends to zero, 
and the convergence is exponentially fast in An. 



High-dimensional settings. When the DAG is lai^ge compared to n, then, the previous theo- 
rem leads to a consistent estimation of the hull, if the interval defining [i is not empty, i.e., n ^ 
4<T27(l/)a;ii|W|''q2. Since 7(y) = 0(log(num(y)) + log(deg(y))), this implies that we may 
have correct hull selection in situations where n = 0(log(num(y)) + log(deg(y))). We may 
thus have an exponential number of connected components and an exponential out-degree, with no 
constraints on the maximum depth of the DAG (it could thus be infinite). 

Here, similar scalings could be obtained with a weighted ^i-norm (with the same weights 
^depth(i,). however, such a weighted Lasso might select kernels which are far from the roor and 
would not be amenable to an efficient active set algorithm. 
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Multiple kernel learning (group Lasso). In this situation, we have a DAG with p connected 
components (one for each kernel), and zero out-degree (i.e., deg(y) = 1), leading to ^{V) = 
0((log jp)^/^), a classical n on-asym ptotic resi i lt in the unstructured settin gs for finite-dimensional 
groups ( Nardi and Rinaldo . 12008: Wainwright . 20091 : Lounici et al.l 20091) . but novel for the multi- 
ple kernel learning framework, where groups are infinite-dimensional Hilbert spaces. Note that the 
proof techniques would be much simpler and the result sharper in terms of power of |W| and to with 
finite-dimensional groups and with the assumption of invertibility of Sww and/or fixed design as- 
sump tions. Finally, Theorem|7]also applies for a modified version of the elastic net dZou and Hastiel. 
20051), where the ^2 -norm is added to the sum of block £1 norm — by considering a single node with 



the null kernel connected to all other kernels. 



Non linear variable selection. For the power set and the directed grids that we consider for non- 
linear variable selection in Section we have num(y) = 1 and deg{V) = p where p is the 
number of variables, and thus jiV) = O(logp) = 0(log log \V\), i.e., we may have exponentially 
many variables to choose non-linearly from, or a doubly exponential number of kernels to select 
from. 



Trade-off for weight /3. Intuitively, since the weight on the norm ||/d(i;)|| is equal to ^dcpth(i))^ 
the greater the f3 the stronger the prior towards selecting nodes close to the sources. However, if /? 
is too large, the prior might be too strong to allow selecting nodes away from the sources. 

This can be illustrated in the bound provided in Theorem|7] The constant 7(y) is a decreasing 
function of /3, and thus having a large f3, i.e., a large penalty on the deep vertices, we decrease 
the lower bound of allowed regularization parameters /U and thus increase the probability of correct 
hull selection (fai^ away vertices are more likely to be left out). However, since 0(f) is a rapidly 
increasing function of (3, the upper bound decreases, i.e., if we penalize too much, we would start 
losing some of the deeper relevant kernels. Finally, it is worth noting that if the constant f3 tend 
to infinity slowly with n, then we could always consistently estimate the depth of the hull, i.e., the 
optimal interaction complexity. Detailed results are the subject of ongoing work. 



Results on estimation accuracy and predictive performance. In this paper, we have focused 
on the simpler results of hull selection consistency, which allow simple assumptions. It is how- 
ever of clear interes t of fo llowing the Lasso work on estimation accuracy and predictive perfor- 
mance (IBickel et all. 120090 and extend it to our structured setting. In particular, the rates of con- 
vergence should also depend on the cardinal of the active set | W| and not on the cardinality of the 
DAG \V\. 



Enhancing consistency condition. The sufficient condition in Eq. ((22)) states that low correla- 
tion between relevant and irrelevant feature spaces leads to good model selection. As opposed to 
unstructured situations, such low correlation may be enhanced with proper hierarchical whitening 
of the data, i.e., for all f G F, we may project {^v{xi))i=i,...,n to the orthogonal of all ancestor 
vectors {^w{xi))i=i,...,n, w G A(i;). This does not change the representation power of our method 
but simply enhances its statistical consistency. 

Moreover, Assumption (A3) is usually met for all the kernel decompositions presented in Sec- 
tion l3.2[ except the all-subset Gaussian kernel (because each feature space of each node contains the 
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feature spaces associated with its parents). However, by the whitening procedure outUned above, 
similar results than Theorem |7] might be obtained. Besides, if the original variables used to define 
the kernel decompositions presented in Section [T2] are independent, then the consistency condition 
in Eq. (l22l) is always met except for the all-subset Gaussian kernel; again, a pre-whitening procedure 
might solve the problem in this case. 



Necessary consistency condition. We also have a necessary condition which is a weak form of 
the s ufficient cond ition in Eq. (l22l) — the proof follows closely the one for the unstructured case 
from lBachl(l2008ah : 



Proposition 8 (Necessary condition for iiull consistency) Assume (Al-3) and V is fixed, with n 
tending to +00. If there is a sequence of regularization parameters A such that both the prediction 
function and the hull of the active kernels is consistently estimated, then we have 

0;^.[Diag(I]i/^)w=Cw=wCwVw] ^1. (24) 



The conditions in Eq. (1221 ) and Eq. (1241 ) make use of the dual norm, but we can loosen them using 
lower and upper bounds on these dual norms: some are computable in polynomial time, like the 
ones used for the active set algorithm presented in Section l44l and more detailed in Appendix IB. 7 1 
However, we can obtain simpler bounds which require to look over the entire DAG; we obtain by 
lowerbounding ||/D{i>) || by ||/^,|| and upperbounding it by Y^waDiv) Wf^W in the definition of VL{f), 
for g e !F: 

\\9w\\ ^ ^* / \ ^ \\9w\\ 

max -^^ ^ ''w^lfiw^j ^ max 



The lower and upper bounds are equal when the DAG is trivial (no edges), and we get back the 
usual weighted norm max^gw= ^^4^. 



5.3 Universal Consistency 



In this section, we briefly discuss the universal consistency properties of our method when used for 
non-lineai- variable selection: do the kernel decompositions presented in Section Yhl\ allow the esti- 
mation of arbitrary functions? The main rationale behind using all subsets of variables rather than 
only singletons is that most non-linear functions r nay not be expresse d as a sum o f functions which 
depend only on one variable — what regular MKL (IBach et al.ll2004al) and SPAM (IRavikumar et al 



would use. All subsets ai^e thus required to allow universal consistency, i.e., to be able to 
approach any possible predictor function. 

Our norm ri(f ) is equivalent to a weighted Hilbertian norm, i.e.: 



Therefore, the usual RKHS balls associated to the universal kernels we prese nt in Section 13.21 
are contained in the b all of our norms, hence we obtain universal consistency (ISteinwartl. |2002| ; 
Micchelli et al.l l2006r) in low-dimensional settings when p is small. A more detailed and refined 
analysis that takes into account the sparsity of the decomposition and convergence rates is out of the 
scope of this paper, in particular for the different regimes for p, q and n. 
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6 Simulations 



In this section, we report simulation experiments on synthetic datasets and datasets from the UCI 
repository. Our goals here ai^e (a) to compare various kernel-based approaches to least-squares 
regression from the same kernel, (b) to compare the various kernel decompositions presented in 
Section [3^ within our HKL framework, and (c) to compare predic tive performa nce with non-kemel- 
based methods — more simulations may be found in earlier work ( Bach . 2008^ ). 



6.1 Compared Methods 

In this section, we consider various nonparametric methods for non-linear predictions. Some are 
based on the kernel decompositions defined in Section [3^ Non-kernel based methods were chosen 
among methods with some form of variable selection capabilities. All these methods were used 
with two loops of 10-fold cross-validation to select regularization parameters and hyperparameters 
(in particular (3). All results are averaged over 10 replications (medians, upper and lower quartiles 
are reported). 



Hierarchical kernel learning (HKL). We use the algorithm presented in Section l4!4l with the ker- 
nel decompositions presented in Section [J!2l i.e., Hermite polynomials ("Hermite"), spline kernels 
("spline") and all-subset Gaussian kernels ("Gaussian"). 



Multiple kernel learning (MKL). We use the algorithm presented in Section l44l with the kernel 
decompositions presented in Section [3^ but limited to kernels of depth one, which corresponds to 
sparse generalized additive models. 



Constrained forward selection (greedy). Given a kernel decomposition with rank one kernels, 
we consider a forwai'd selection approach that satisfies the same constraint that we impose in our 
convex framework. 



Single kernel learning (L2). When using the full decomposition (which is equivalent to summing 
all kernels or penalizing by an ^2-norm) we can use regular single kernel learning. 



Gener alized Lasso (Glasso). Given the same kernel matrix as in the previous method, RothI 



(12004) considers predictors of the form Y17=i CKj^i(^) ^i)' with the regularization by the £i-norm of 
a instead of Ka for the regular single kernel learning problem. 



Multivariate additive splines (MARS). This method of FriedmanI ( 1991 ) is the closest in spirit to 



the one presented in this paper: it builds in a forward greedy way multivariate piecewise polynomial 
expansions. Note however, that in MARS, a node is added only after one of its parents (and not all, 
like in HKL). We use the R package with standard hyperparameter settings. 



Regression tree s (CART). We cons ider regular decision trees for regression using the standard R 
implementation (IBreiman et all. 1 19841) with standard hyperparameter settings. 
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Figure 8: Comparison of non-linear regression methods (mean squared error vs. dimension of 
problem (in log scale). (Top left) comparison of greeedy, I2 and i\ (HKL) methods on the same 
Hermite kernel decomposition. (Top right) comparison of several kernel decompositions for HKL. 
(Bottom left) comparison with other kernel-based methods. (Bottom right) comparison with non- 
kemel-based methods. 



B oosted reg r ession trees (boosting). 

of iFriedmatJ (l200lh . 



We use the R "gbm" package which implements the method 



G aussian processes with automa tic relevance determinations (GP-ARD). We use the code 
of Rasmussen and Williams which learns widths for each variable within a Gaussian ker- 

nel, using a Bayesian model selection criterion (i.e., without using cross-validation). Note that 
HKL, with the all-subset Gaussian decomposition, does not search explictly for A in the kernel 
exp(— (x — x')^ A[x — x')), but instead considers a large set of particular values of A and finds a 
linear combination of the corresponding kernel. 



6.2 Synthetic Examples 

We generated synthetic data as follows: we generate a covariance matrix from a Wishart distribution 
of dimension p and with 2p degrees of freedom. It is then normalized to unit diagonal and n 
datapoints are then sampled i.i.d. from a Gaussian distribution with zero mean and this covariance 
matrix. We then consider the non-linear function /(X) = ^Y^j=iJ^\^j^i^ which takes all 

cross products of the first r variables. The output Y is then equal to ^QC) plus some Gaussian noise 
with known signal-to-noise ratio. 
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Figure 9: Comparison of non-linear regression methods (mean squared eiTor vs. dimension of 
problem (in log scale). (Top left) comparison of greeedy, £2 and ii (HKL) methods on the same 
Hermite kernel decomposition. (Top right) comparison of several kernel decompositions for HKL. 
(Bottom left) comparison with other kernel-based methods. (Bottom right) comparison with other 
non-kernel-based methods. 



Results ai^e reported in Figure [8] On the top left plot, we compai^e different strategies for linear 
regression, showing that in this constrained scenario where the generating model is sparse, £1- 
regularization based methods outperform other methods (forward selection and ridge regression). 
On the top right plot, we compare different kernel decompositions: as should be expected, the 
Hermite and spline decompositions (which contains exactly the generating polynomial) performs 
best. On the bottom left plot, we compare several kernel-based methods on the same spline kernel, 
showing that when sparsity is expected, using sparse methods is indeed advantageous. Finally, on 
the bottom right plot, we compare to non-kernel based methods, showing that ours is more robust 
to increasing input dimensions p. It is also worth noting the instabilities of the greedy methods such 
as MARS or "greedy", which sometimes makes wrong choices at the start of the procedure, leading 
to low performance. 



6.3 UCI Datasets 



We p erform simulations on the "pumadyn" datasets from the UCI repository (|Blake and Merzl. 



19981) . These datasets are obtained from realistic simulations of the dynamics of a robot arm, and 



have different strengths of non-linearities (fh: fairly linear, high noise; nh: non-linear, high noise) 
and two numbers (8 and 32) of input variables. 
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Results are reported in Figure |9l On the top left plot, we compare different strategies for lin- 
ear regression with n = 1024 observations: with moderately non-linear problems (32fh, 8fh), all 
performances are similar, while for non-linear problems (32nh, 8nh), HKL outperforms other meth- 
ods (forwai^d selection and ridge regression). On the top right plot, we compai^e different kernel 
decompositions: here, no decomposition includes the generating model, and therefore, none clearly 
outperforms the other ones. On the bottom left plot, we compare several kernel-based methods on 
the same spline kernel: it is interesting to note that for moderately linear problems, MKL performs 
well as expected, but not anymore for highly non-linear problems. 

Finally, on the bottom right plot, we compare to non-kernel based methods: while boosting 
methods and CART ai^e clearly performing worse, HKL, MARS and Gaussian processes perform 
better, with a significant advantage to MARS and Gaussian processes for the dataset "32nh". There 
are several explanations regarding the worse performance of HKL that could lead to interesting de- 
velopments for improved performance: first, HKL relies on estimating a regularization parameter 
by cross-validation, while both MARS and GP-ARD rely on automatic model selection through fre- 
quentist or Bayesian procedures, and it is thus of clear interest to consider methods to automatically 
tune the regularization parameter for sparse methods such as HKL. Moreover, the problem is not 
really high-dimensional as n is much larger than p, and our regularized method has a certain amount 
of bias that the other methods don't have; this is a classical problem of £i -regularized problems, and 
this could be fixed by non-regularized estimation on the selected variables. 



7 Conclusion 



We have shown how to perform hierarchical multiple kernel learning (HKL) in polynomial time in 
the number of selected kernels. This framework may be applied to many positive definite kernels and 
we have focused on kernel decompositions for non-linear variable selection: in this setting, we can 
both select which variables should enter and the corresponding degrees of interaction complexity. 
We have proposed an active set algorithm as well a theoretical analysis that suggests that we can 
still perform non-linear variable selection from a number of variables which is exponential in the 
number of observations. 

Our framework can be extended in multiple ways: first, this paper shows that trying to use 
£i-type penalties may be advantageous inside the feature space. That is, one may take the opposite 
directions than usual kernel-based methods and look inside the feature spaces with sparsity-inducing 
norms instead of building feature spaces of ever increasing dimensio ns. We are currently investi- 
gating applications to other k ernels, such as the pyramid match kernels ( Grauman and Darrelj 2007 : 

Cuturi and Fukumizul . l2006l) . string kernels, and graph kernels (see, e.g.. lShawe-Taylor and Cristianini 



20041) . Moreover, theoretical and algorithmic connections with the recent work of iHuang et al, 
(I2OO9I) on general structured sparsity and greedy methods could be made. 



Moreover, we have considered in this paper a specific instance of block ^1 -norms with over- 
lapping groups, i.e., groups organized in a hierarchy, but some of the te chniques and framew orks 
presented here can be extended to more general overlapping structures ( Jenatton et al. , 20091) . for 
DAGs or more general graphs; it would also be interesting to consider non discrete hierarchical 
structures with a partial order, such as positive definite matrices. 

Finally, we hope to make connections wi th other uses of sparsity-inducing norms, i n particular 
in signal processing, for compressed sensing ( Baraniuk . 2007 ; Candes and Wakin . 2008h . dictionary 
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learni ng (lOlshausen and Field 
2007h . 



19970 and sparse principal component analysis (Id'Aspremont et al. 



A Proofs of Optimization Results 

In this first appendix, we give proofs of all results related to the optimization problems. 
A.l Set of Weights for Trees 

We prove that the set of weights (, i.e., Z, is itself convex when the DAG is a tree. We conjecture 
that the converse is true as well. 

Proposition 9 IfV is a tree, the set Z = {C(r/) € M^, r/ G M^, Y.v(^v ^Vv ^ 1} 

IS convex. 

Proof When the DAG is a ti"ee (i.e., when each vertex has at most one pai^ent and there is a single 
source r), then, we have for all v which is not the source of the DAG (i.e., for which there is exactly 
one parent), — C,^'^ = —rj~^. This implies that the constraint ^ is equivalent to (^t, ^ for 
all leaves v, and for all v which is not a source, ^^^{i;) ^ Cv, with equality possible only when they 
are both equal to zero. 

Moreover, for the source r, C,r = rjr- The final constraint "Y^^^y Vvd^ ^ 1, may then be written 
as T^v^r dl .-L V-1 + Crdr ^ 1, that is, Y , dl (C^ + , ) + Crdl < 1, which is a convex 

constraint (IBoyd and Vandenberghd . 120031) . ■ 



A.2 Proof of Proposition U 

We introduce auxiliary variables Ui = (/, ^{xi)) + b and consider the Lagrangian: 

C{u, f,b,a) = -Y, ^^{^^) + 2 11/11' + ^ E - ^(^^)) - 

i=l i=l 

Minimizing with respect to the primal variable u leads to the term — ^ ipi{—nXai); minimiz- 
ing with respect to / leads to the term —^a^Ka and to the expression of / as a function of a, and 
minimizing with respect to b leads to the constraint ija = XlILi '^^ ~ 

A.3 Preliminary Propositions 

We will use the following simple result, which implies that each component Cwif]) ^ concave 
function of r] (as the minimum of linear functions of r/): 

Lemma 10 Let a E (M!^)™'. The minimum ofY^Y=i '^j^j subject to x and Yl^=i ~ ^ 
equal to [Y^^'^i c^r^) is attained at Xi = a^^ (Y1T=i '^j"^) 



31 



Proof The result is a consequence of applying Cauchy-Schwartz inequality, applied to vectors with 

1/2 —1/2 

components x^a- and a- . Note that when some of the Oj are equal to zero, then the minimum 
is zero, with optimal Xj being zero whenever Oj 7^ 0. ■ 

The following proposition derives the dual of the problem in r], i.e., the dual of Eq. (fTTl) : 

Proposition 11 Let L = {k G m1|^^^, K:p^{wYw = Oand'\/w G V, J2v&A{w) ^-"w = 1}- The 
following convex optimization problems are dual to each other, and there is no duality gap : 

min < maxd"^ 7^ K^^oi^Kwa}, (25) 

max (w{r])a^ Kwa. (26) 

Proof We have the Lagrangian C{A, K,r]) = A + X^^gy rj^ {T.w&d{v) K-lw^^'^^^a - Adl^ , with 
r/ ^ 0, which, using Lemma [TOl can be minimized in closed form with respect to A, to obtain the 
constraints J2vev Vvdf, = 1 and with respect to k G L. We thus get 



max ( ^ Cw{'n)Kw \ a. 



a, 



Given rj, the optimal value for k has a specific structure (using Lemma[TOl for all w G V): (a) if for 
all V G A(u;), ry^, > 0, then k^w = Cw^v^ for '^^ ^ A(it;), (b) if there exists v G A('u;) such that 
rjv = 0, then for all v G A(i(;) such that rjy > 0, we must have k^w = 0. ■ 



A.4 Proof of Proposition |3] 

We consider the following function of r] £ H and a G M" (such that l^a = 0): 

1 " A / ~ \ 

F{r], ") = -- X] V'i(-'^AQi) - -a^ i ^ Cw{v)K^w j a. 

4 = 1 ^IDSV ^ 

This function is convex in r] (because of Lemma [TOl) and concave in a; standard arguments (e.g., 
primal and dual strict feasibilities) show that there is no duality gap to the variational problems: 

inf sup F{r],a)= sup infF(r7,a). 
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We can decompose the duality gap, given a pair (ry, a) (with associated / and b) as: 



sup F{r],a')— inf F{'q\a) 




< - 






We thus get the desired upper bound from which Proposition [3] follows, as well as the upper bound 
on the duality gap in Eq. (fTSl ). 

A.5 Proof of Propositions |4] and |5] 

We assume that we know the optimal solution of a truncated problem where the entire set of decen- 
dants of some nodes have been removed. We let denote W the hull of the set of active variables. 
We now consider necessary conditions and sufficient conditions for this solution to be optimal with 
respect to the full problem. This will lead to Propositions |4] and |5] 

We first use Proposition [TT] to get a set of for {v, w) for the reduced problem; the goal 
here is to get necessary conditions by relaxing the dual problem in Eq. (1251 ). defining k G L and 
find an approximate solution, while for the sufficient condition, any candidate leads to a sufficient 
condition. It turns out that we will use the solution of the relaxed solution required for the necessary 
condition for the sufficient condition. 

Necessary condition. If we assume that all variables in W are active and the reduced set is optimal 
for the full problem, then any optimal k G L must be such that = if u G and w G W^, 
and we must have k^w = CwVv^ for t; G and w e D{v) nW (otherwise, rjw cannot be optimal 
for the reduced problem, as detailed in the proof of Proposition fTTT). We then let free for v, w 
in W^. Our goal is to find good candidates for those free dual parameters. 
We can lowerbound the sums by maxima: 
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which can be minimized in closed form with respect to k leading to k.^^ = (^X]i)'eA(«))nvV"= 
and, owing to Proposition [TT] to the following lower bound for max^g// YlweV Cwiv)'^'^ ^wC(- 

1.2 a^K^a \ \ 2 a^Ku,a , 
max { 6 , max -= — j > ^ max < 6 , max -= \ , (27) 



where 5^ = X]^g^y Cto(w)ci^-f^«)a = If the reduced solution is optimal we must have 

this lower bound smaller than 6"^, which leads to Eq. ( |20l ). Note that this necessary condition may 
also be obtained by considering the addition (alone) of any of the sources w G sources(iy^) and 
checking that they would not enter the active set. 

Sufficient condition. For sufficient conditions, we simply take the previous value obtained before 
for K, which leads to the following upperbound for max^g// Ylwev Cwi'n)'^'^ ^wCt- 



\ 2 sr^ K^a \ \ 2 sr^ K^^a } 
max <^ , max > -= > = max < , max > -= — ^ > , 



because for all v G W^, there exists t G sources (1^"^) such that v G D(t). We have moreover for 
all t G W, 

^ ^ ^ dv, 

t!eA(«))nVK'= veA{w)nD{t) 

leading to the upper bound: A = max 1 5 VaxtgsourccsCiy:) YltoeD{t) (E„6 1(1^0") <i^y' } ' 
in Eq. ([TSl l is thus less than X/2{A — 6'^), which leads to the desired result. 

A.6 Optimality Conditions for the Primal Formulation 

We now derive optimality conditions for the primal problem in Eq. (fT3] ). when the loss functions ipi 
are differentiable, which we will need in Appendix |Bj that is: 



where L{f,b) is the differentiable loss function. Following iBachI (l2008al) and Proposition |2l the 
solution may be found by solving a finite-dimensional problem, and thus usual notions of calculus 
may be used. 

Let / G = n»;GV ^^'^ G M, where / / 0, with W being the hull of the active functions 
(or groups). The directional derivative in the direction (A, r) G J-'^ x M is equal to 

{VfLif,b),A) + vM^\b)T + xn{f)(y2dJ-^^,A^+ V d,||AD(.)||Y 

and thus (/, b) if optimal if and ony if Vi,L{f, b) = (i.e., b is an optimal constant term) and if, 

with (5 = n{fy. 

ywGW,Vf^L{f,b) + X6( V ^1_V^ = 0, (28) 
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and VAiyc G 



E 



(Vj„L(/,6),A^) + A5 



lA 



D{^)l 



^0. 



(29) 



We can now define for K C V, Q kI/k) = y^i.^jy c^7)ll.fr)(7 . )nR-IL the norm reduced t o the func- 
tions in K and ilj^ its dual norm ( Boyd and Vandenberghe . 2003 : Rockafellar . 1970h . The last 
equation may be rewritten: f]^c(Vj,^L(/, 6)) ^ X6. Note that when regularizing by Ar2(/) = 



^ Ylvev dv\\fD{v) II instead of ^ | X^dsV II /d(i)) II ) » we have the same optimality condition with 
6 = 1. 



B Proof of Theorem III 

In this appendix, we provide the proof of Theorem |7] with several intermediate results. Following 
usual proof techniques from the Lasso literature, we will consider the optimization reduced to ker- 
nels/variables in W, and (a) show that the hull of the selected variables is indeed the hull of W 
(i.e., itself because we have assumed in (AO) that W is equal to its hull) with high probability, and 
(b) show that when the reduced solution is extended to with zeros, we have the optimal global 
solution of the problem with high probability. The main difficulties are to use bounds on the dual 
norms of our structured norms, and to deal with the infinite-dimensional group structure within a 
non-asymptotic analysis, which we deal with new concentration inequalities (Appendices IB.2I IB. 31 
and El). 



B.l Notations 

Let jj-v = Y^l=i '^vixi) G Tv be the empirical mean and /j^ = E$t,(X) G JT^ the population 
mean of and = ^Yll=ii^i'i^i) ~ M ® — j^w) be the empirical cross- 

covariance operator from to and = }iYll.=i^i{'^v{xi) — fiv) G for v,w G V, 
where ei = Ui — X]«;ew ^w{xi) — b is the i.i.d. Gaussian noise with mean zero and variance a^. 
By assumption (A2), we have tr ^ 1 and tr ^ 1 for all v & V , which implies that 
Amax(5]ww) ^ |W| and Amax(Sww) ^ |W|. 

All norms on vectors in Euclidean or Hilbertian spaces ai^e always the Euclidean or Hilbertian 
norms of the space the vector belongs to (which can always be inferred from context). However, 
we consider several norms on self-adjoint operators between Hilbert spaces. All our covaiiance 
operators are compact and can thus b e diagonalized in an Hilbertian basis, with a sequence of 
eigen values that tends to zero (see, e.g.. lBrezisLll980l : lBerlinet and Thomas-Agnanl . l2003l : IConwayi. 



19971) . The usual operator norm of a self-adjoint operator A is the eigenvalue of largest magnitude 
of A and is denoted by ||A||op: the Hilbert-Schmidt norm is the ^2-norm of eigenvalues, and is 
denoted by ||j4||hs! and is equal to the Frobenius norm in finite dimensions. Finally, the trace 
norm is equal to the £i-norm of eigenvalues, and is denoted by ||j4||tr. In Section Ib31 we provide 
novel non asymptotic results on the convergence of empirical covariance operators to the population 
covariance operators. 

6.2 Hoeffding's Inequality in Hilbert Spaces 

In this section, we prove the following proposition, which will be useful throughout this appendix: 
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Proposition 12 Let Xi, . . . , X„ be i.i.d. zero-mean random observations in the Hilbert space 7i, 
such that for alii, \\Xi\\ ^ 1 almost surely. Then, we have: 



^ i=l ^ 



exp 



(30) 



Proof We denote ^ = || ^ Y17=i If all Xi a re held fixed bu t one, then Z may only change by 
-. Thus, from Mc Diarmid's inequality (see, e.g.. iMassarti. l2003l Theorem 5.1, page 148), we have, 
for all t ^ 0: 

¥{Z - EZ ^ t) ^ exp{-nf/2). 
Moreover, using the Hilbertian structure of H: 



/ 1 " \ 1/2 

EZ < (EZ2)^/2 = — ^ K{X,,Xj) = n-^/2(E||Xif )V2 ^ 



n 



-1/2 



This leads to ¥{Z ^ n~'^/h + n^^/^) ^ exp{-t'^/2) for all t ^ 0, i.e., for all t ^ 1, ¥{Z ^ 
^ exp(-(t - 1)2/2). If t ^ 2, then {t - 1)^ ^ and thus P(Z ^ tn"^/'^) ^ 

exp(-t2/8) ^ 2exp(-nt2/8). Fort ^ 2, then the right hand side is greater than 2 exp (-1/2) > 1, 
and the bound in Eq. (l30l ) is trivial. ■ 



B.3 Concentration Inequalities for Covariance Operators 

We prove the following general proposition of concentration of empirical covariance operators for 
the Hilbert-Schmidt norm: 

Proposition 13 Let Xi,. . . ,Xn be i.i.d. random observations in a measurable space X, equipped 
with a reproducing kernel Hilbert space T with kernel k, such that k{Xi, Xi) ^ 1 almost surely. 
Let S and T, be the population and empirical covariance operators. We have, for all x ^ 0.' 



SIIhs ^ ^ 4 



exp 



X 
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Proof We first concentrate the mean, using Proposition [T2j since the data is universally bounded 
by 1: 



l/i — n\\ ^ t) ^ 2 exp 



nt'' 



The random variables (<I>(Xj) — /i) {^(Xi) — fi) are uniformly bounded by 1 in the Hilbert 
space of self-adjoint operators, equipped with the Hilbert-Schmidt norm. Thus, using Proposi- 
tion [121 we get 



n s 
S - - J^($(X,) -tl)^ ($(X,) - /X) ^ X 
^ i=l HS / 



^ X ^ 2 exp ( — 



nx 
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Thus, since S = i Y.7=li'^i^^)-^^MH^^)-^^)+i^^-f^)^i^^-^^)' and \\{n-fi)^{fi-fL)\\us 
Wfj, — we get: 

2 

- — / Tlx \ 

|5]-S||hs ^x) ^2exp - — +2exp 



nx 
16 



^ 4exp 



nx 

'32 J' 



as long as X ^ 2. When x > 2, the bound is trivial because — S||hs ^ x occurs with probability 
zero. ■ 

We now prove the following general proposition of concentration of empirical covariance oper- 
ators for the trace norm: 

Proposition 14 Let Xi,. . . ,Xn be i.i.d. random observations in a measurable space X, equipped 
with a reproducing kernel Hilbert space T with kernel k, such that k{Xi,Xi) ^ 1 almost surely. 
Let XI and S the population and empirical covariance operators. Assume that the eigenvalues of X) 
are root-summable with sum of square roots of eigenvalues equal to Ci/2- have, if x ^ 4Ci/2-' 

2 

PdlE - Elltr ^ 3exp ( " |^)- 

Proof It is shown by iHarchaoui et al. (l2008|)that 

E||S - S||tr ^ Ci/2n"^/2_ 
Thus, following the same reasoning as in the proof of Proposition [T2j we get 

pf S - i {^{Xi) -fi) ^ (C7i/2 + t)n-^A ^ exp(-tV2), 

and thus if t ^ 2C1/2, we have: 



1 " 

s - - J2{Hx,) - {^{Xi) - fi] 

n ^-^ 



1=1 



tr 



^ tn~^lA ^ exp(-tV8). 



We thus get, for x ^ 4Ci/2' 



P(||S-S||tr ^xn~^/2) ^exp(-xV32) + 2exp( j ^ 3exp(-xV32), 

as long as xn"^/^ ^ 2. If this is not true, the bound to be proved is trivial. 



B.4 Concentration Inequality for Least-squares Problems 

In this section, we prove a concentration r esult that can be applied to several problems i nvolv 



ing lea st-squares and covariance operators (|Harchaoui et all |2008|; iFukumizu et all. 120071 : iBach 
2008ah : 
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Proposition 15 Let Xi,. . . ,X„ be i.i.d. random observations in a measurable space X, equipped 
with a reproducing kernel Hilbert space J- with kernel k, such that k{Xi,Xi) ^ 1 almost surely. 
Let XI and S the population and empirical covariance operators. Assume that the eigenvalues of S 
are root-summable with sum of square roots of eigenvalues equal to C\j2- Let e be an independent 
Gaussian vector with zero mean and covariance matrix a^\. Define q = ^ Yli=i ^ii^i^i) ~ A)- 

We have, for all t ^ (Aa^n-^ A-i/^Ci/a + ||S - EUtrA"! Y^'^: 

P(||(S + Xiy^^'^qW ^ t\X) ^ exp(-ntV2o-2) 

Proof Given the input variables, || (S+AI)~^/^g|| is a Lipschitz-continuous function of the i.i.d. noise 
vector e, with Lipschitz constant n"^/^. Moreover, we have 



E 



S + AI)-i/2^|||x) < E(||(S + AI)-i/2g||2|X 



1/2 



crn 



-1/2 



trS(S + AI)- 



We now follow Harchaoui et al. (l2008h for bounding the empirical degrees of freedom: 

trS(S + AI)"^ -trS(I] + AI)-^ 
= Atr(S + AI)"i(S-S)(S + AI)~i 

^ A||S - I]||tr||(S + AI)-i||op||(S + AI)-^||op ^ A-i||S - Slltr. 
Moreover, we have: tr X;(S + AI)"^ X^^/'^Ci/2. This leads to: 



E 



A + — S||trA 



The fi nal bound is ob tained from concentration of Lipschitz-continuous functions of Gaussian vari- 
ables (lMassartl . l2003[) : 

P(||(S + AI)~^/2g|| ^ t\X) ^ exp(-ntV2o-2) 



as soon as > Aa'^n ^ 



A + IIS — S||trA 



B.5 Concentration Inequality for Irrelevant Variables 



In this section, we upperbound, using Gaussian concentration inequalities (IMassarti. |2003|), the tail- 
probability 

¥{n^4z] ^ t), 

where z = —qw + Sw<:w(Sww + D)^^q-w, for a given deterministic nonnegative diagonal 
matrix D. The vector z may be expressed as weighted sum of the components of the Gaussian 
vector e. In addition, r2^c[gw<=] is upperbounded by max^gw^ lbi«||c^«;^ ^ max^gw: lbi«||- 
Thus by concentration of Lipschitz-continuous function s of mult i variat e standai^d random variables 
(we have a 



1^-1/2. 



Lipschitz function of e), we have (IMassarti . 120031) : 



F[n^4z];^t+E{n^4z]\x)\x] ^exp 
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For all w G W^, given (xi, . . . n^/^cr~^z^ G is normally distributed with covariance 
operator which has largest eigenvalue less than one. We now decompose W*^ by values of '■ by 
assumption, may take value d,. or a power of /3 (we let denote V the set of values of d^, w € V). 
We get (where x denotes all input observations): 



n^/^cr-^E I max -ii^lx ) ^ n^/^^^ V E I max 



2 



2 



2 



We thus get \ 



X 



^ exp (— Y )' ^iid if we use t ^ 2A, we get 



t 



Note that we have used the expectation of the maximum of q norms of Gaussian vectors is less than 
2(log(2q'))^/^ times the maximum of the expectation of the norms. 



Upper bound on A. The cardinal of depth ^ (k) is less than num(F) deg{V)'', thus, since P > 1, 

2 



^ = E^l°g(2|depth-i(A:)|)i/2 
2 



k>0 



^ E ^[(l°g(2num(F))i/2 + (A;logdeg(y))V2] 

Moreover, we have, by splitting the sum at (21og/3)~^, and using the fact that after the split, the 
function x i-^ /3~^x^/2 decreasing: 

(21og/3)-i oo 

2^r'A:i/2 ^ 2^/3-'=A:i/2 ^2 Yl r'^^'^' + 2 Yl 

fc^O fc^l fc=l fc=(21og/3)-i 



2 



"+00 



H-oo 



21og/3)3/2 7o 

^ ..3,, +2(log/3)-3/M e-V/^dx, 

(21og/?)'^/2 



1 2 
^ 7; TWTi^ (1 + r(3/2)) ^ T, ^T^' where r(-) is the Gamma function. 

(logp)'^/^ (logp)'^/"^ 
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This leads to ^ ^ (log(2num(y))^/^ + (log deg(V))^/^ (log^)^/^ ^^^^ the expression for 

7(y) in Theorem |7] 

B.6 Error of the Reduced Solution 

We have the following loss function (optimized with respect to the constant term 6 G M) 

^(/) = ^(/-f,S(/-f))-((Z,/-f). 



Following BachI ( 2008a ) and Nardi and Rinaldol ( 2008h . we consider the reduced problem on W, 



miuf^jr^ /w<:=o -^(/) + ^^wifw), with non unique solution / (since Sww is not invertible in 
general). The goal here is to show that / and f are close enough so that for all w G W, /£)(^) 7^ 0; 
this will implies that the hull of the active set of / is indeed W. 

As opposed to the Lasso case, we also need to consider /w the minimum of /w ^ ^(/w) + 

\ II f IP 

f zl/i>ew ^" . which con^esponds to the local quadratic approximation of the norm around fw, 
where 

rrf . \m{v)\\ 

Moreover, we consider the corresponding noiseless version fw of /w (the solution for e = 0). We 
will compute error bounds ||fw — fw||» ||/w — fw|| and ||/w — /w||, which will provide an upper 
bound on ||/w — fw|| (see Proposition [T9l). In particular, once we have ||/w — fw|| ^vjl, then 
we must have ||/d(io) || > for all G W and thus the hull of selected kernels is indeed W. 



Lemma 16 We have: 



,, , ,, „2\ !^(f)^|W|V2 

|f"W — fwll ^ ( A + ySww — 5]ww||opdr ) ■ (32) 



Proof The function f is defined as, with D = Diag((^^^I), 

fw = (Sww ~l" AZ?) ^Swwfw ~ fw ~ A(Sw^w^ + A-D) ^Z)fw 

Thus, we have 

(Sww + AI?) ^(Sww ~ ^ww)(^ww + AI?) ^£>fw 



iw — I 



^ A 



+ A||(Sww + AZ))-iZ?f. 



w|| • 

2 



We can now upper bound || (Sww + AL')-^L'fw|| ^ ||hw||K"^ ||^||op ^ \^\''^'^ K~^Vt{ifv 
ll/w - fwll ^ ( A + ||Sww - Swwllopll-D "^llop) ||(Sww + AI?)~"^L'fw|| 



< ^A + ||SwW - Swwllop^r ^) ^2^^ |W|^/^K ^. 



We have used moreover the following identities: 



C-^ ^ dl and C;^^ = ^^(f) E II - 2 



40 



which leads to ||-D~^||op ^ d'"^ and ||i:>||op ^ ^^(f)^!/"^. ■ 
Lemma 17 We have: 

ll/w - fwll ^ A-i/2rf-i||(g^^ + \D)-^/^q^\\. (33) 

Proof The difference / — f is equal to, with D = Diag(C^^I), /w — fw = (Sww + AZ))~^gw- 
Thus, ll/w - fwll sS A-^/^llD^^/^llop X ||(Sww + \D)-'^l'^q^^\\, which leads to the desired 
result. ■ 



Lemma 18 Assume ||/w - fw|| ^ A ^ |W|(i^^ and ||Sww - ^wwllop ^ TfrWT" 
ll/w - /wll ^ min ■ 



16|W 

have: 

■96|W|3/2||fw-/w||0(f)2 



v^ndl ' 8|W|3/2 ' 4 

Proof We consider the ball of radius 5 ^ m.m.{-^^j^^ , |} around /w, i-C-, Bs{fw) = {/w S 

•^w, ll/w — /wll ^ f^}- Since 6 ^ i//4 and ||/w — fw|| ^ ^^74, then in the ball Bsifw), we 
have for all w G W, ||/D(«;)nwll ^ 2^/2- On the ball Bs{fw)> the function Lw : /w i-^- -Z>(/w) 
is twice differentiable with Hessian Sww> while the function H-w '■ fw > ^^^w(/w)^ is also 
twice differentiable. The function H-w is the square of a sum of differentiable convex terms; a short 
calculation shows that the Hessian is greater than the sum of the functions times the sums of the 
Hessians. Keeping only the Hessians corresponding to the (assumed unique) sources of each of the 
connected components of W, we obtain the lower bound (which still depends on /): 



ygw 

dfwdfw 



(/w) > rf.rf^w(/w)Diag 



1 



Wfc 



Al-\\fc\\-'fcf^) 



cgc(w) 



where C(W) are the connected components of W. We can now use Lemma l20l to find a lower 
bound on the Hessian of the objective function Lw + A//w on the ball Bs{iw)- with A = 
Amm[((/c, ^CDfD))c,Deciw)]' we obtain the lower bound 

„ A . (\d^^] A\d^. 
B = — mm < 1, ' — 



lop 



3 I ' |W| j 3|W| ' 

because ilw(/w)||/cir^ ^ dr, Amax(Sww) =^ |W|, and A ^ \W\d~'^. 

We have moreover on the ball Bs{fw) (on which ||/w|| ^ 2||fw|| ^ 2|W|^/^), 

A ^ Amm[((/c, ScD/D))c,DeC(W)] - ™ax ||/c ||^ ||SwW - 5]ww| 

cgc(w) 

^ K min ||5]^/2/^„|p - 4|W|||5]ww - Swwilop 
CGC w) ^ 

^ K min V llSi/^f^f -2k|W|^/2^|W| -4|W|||Sww-Sww||op 

w&C 
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because we have assumed that that 2k| W|^/^5|W| ^ v^nj^ and 4|W|||I]ww — Swwilop ^ 

We can now show that /w and /w are close, which is a simple consequence of the lower 
bound B on the Hessian. Indeed, the gradient of the objective at /^y (applied to z) is equal to 

vew 

^ 2\\\z\\ |W|i/2max|C-C»(/^)-i| 

■ug W 

because IC^ - ^ '"'^J^" «S ||fw - /w||^. If we choose 

A|W|i/2fe^O(f)2 96|W|3/2||fw - /w||J^(f)' 



5>2- 



2 3TWT 



then the minimum of the reduced cost function must occur within the ball Bs{fw)- H 

We can now combine the four previous lemma into the following proposition: 
Proposition 19 We have: 

|/w-fw||^ Ah -5 h— ^ — IKSww + A-D) ' qw\\- (34) 

\ J K,V dr 

Assume moreover ||/w — fw|| ^ J^/4, A ^ |W|(i~^ a«J ||Sww — 5]ww||op ^ TB1W|'' f^^^^' 

llf f IKIIf f 11^ • f 96|W|3/2||/^-f^||0(f)2 ^2 

||fw - /wll ^ ll/w - fwll + mm I , ^ j- (35) 

B.7 Global Optimality of the Reduced Solution 

We now prove, that the padded solution of the reduced problem / is indeed optimal for the full 
problem if we have the following inequalities (with /i = \Q.{i)dr and uj = Q.{i)d~^): 

llEww - SwwII =^ = O L~^\W\~^/A (36) 

II WW WWII ioi^(f)|w|V2 V ' ' ; 



lEwW - SwwII n/, = /^'O L;-3/2|w|-l/2 (37) 

I WW ^ iori(f)|w|V2 ^ V ' ' ; ^ ' 

ifw - /wll ^ = (.-^/^iwrv2) (38) 

' •'Wll ^ 4ori(f)3|w|i/2 ^ V ' ' y ^ ' 

,3 



|fw - /wll min |i.r?/5, ^1^1 = O [u^-^) (39) 
20f^(f)2|W| 



Ai/2 ^ ^-7^'^'^' ^ i.e., /z^/^ = O L-3/2|wpi/2) (40) 

2orj f 2 w 1/2 V ' ' y 
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n;^c[-qw- + Sw-w(Sww + ^Dy\w] ^ Xn{f)ri/5 = 0{fid;^) (41) 
ll/w - fwlllKSww + AI))-i/2gw|| ^ = /iO (^-2) , (42) 

Following Appendix lA. 61 since ||/w — fw|| ^ i^/2, the hull is indeed selected, and /w satisfies 
the local optimality condition 

Sww(/w - fw) — qw + AOw(/w)sw = 0, 
Sww(/w - fw) - qw + ADiag(C^)/w = 0, 

where sw is defined as (following the definition of s) and = Cifw)'- 

sn.= ( J2 ^-11 /dm 11"')/- = Q'mr'L, Vu; G W. 

^i;eA{u>) ^ 

This allows us to give a "closed form" solution (not really closed form because it depends on 
which itself depends on /): 

/w - fw = (Sww + ADiag(C~^))~^(gw - ADiag(C;;;^)fw)- 

We essentially replace C by and check the optimality conditions from Appendix IA.6I That is, we 
consider the event r2^c[VL(/)w<=] ^ Ar2(/). We use the following inequality, with the notations 

g^^vc = Biag{Y:^y)w'^Cw'^wC^^^'Diag{'Sl!^n{^)-'^Cw^)w^w and D = Diag(C^)w, D = 
Diag(C^)w: 

r2^c[VL(/)w=] = ^^w4~'?W': + Sw':w(/w - fw)] 

= r^w^ [~'?W': + 5]w'=w(Sww + Al)) "^(gw — Al)fw)] 
^ ^W'=[~'?W': + Sw':w(Sww + AD) ^qw] + Ar2^c[gw^] 

+ XDy'Dfw)] 
+ \D)-^D{w - S,^w(Sww + XDy^Dfw] 
+r2^c [Sw<:w(5^ww + AZ)) """gw — (Sww + AD) ^qw] 
^ r2^c[— Q'w': + 5]w':w(Sww + AD) "^gw] + A17^c[gw^] 
+A(^ + D + C7). 

We will bound the last three terms A, B and C by rL{i)rj/b, bound the difference |ri(f) - fl(/)| ^ 
ryO(f)/5 (which is implied by Eq. (l39l )) and use the assumption il^c[gw^] ^ 1 — and use the 
bound in Eq. (|4TI) to bound Q^^n [—qwrc + Swcw(I!ww + AD)~^gw] ^ AS7(f)?//5. Note that we 
have the bound f^^c[gw':] ^ max^.gw^ obtained by lower bounding ||/d(d) || by in the 
definition of w<: ■ 
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Bounding B. We have: 

R = i;u,w(Sww + Al))^^l)fw) — 5]^w(Sww + AI?)^^I)fw) 

(Eww + XD)-\-Eww + XD- Sww + AL>))((Sww + XDy^Dfw)) 

||-R|| ^ ||S„,w — 5]^w||op||-D||op|W|"^/^K; ^ 

+A"l/2||^-l/2||^^P||^^|^|l/2^-l /'lISww - Swwilop + AIIL* - D\\op 



+ \\D - D\\op\W\^^^ 

+A-i/2^-i||2J)(f)2zy-2|^|i/2K-i (^||Sww - Swwilop + A4J7(f)2zy"3||j _ f 

+ |W|l/240(f)2l/-3||/-f||, 

which leads to an upper bound B ^ (ijr^||i?||. The constraints imposed by Eq. (l36l) . Eq. (|37] ). 
Eq. ([38]) and Eq. ([391) imply that 5 < 17(f)r//5. 

Bounding ^. We cons ider the terrn S wcwf^ww + Xp)~^p f\^). Because of the operator 
range conditions used by lBac i] (l2008ah and lFukumizu et all (120070 . we can write 



Diag(Si/^)Cww Diag(Sj/^)7 = Sww7 = Diag(S;^,„)hw, 
where ||7|| ^ ||h||. We thus have 

I]^w(5]ww + AZ))-iL'fw = 5]i/JC^wDiag(5:J/2)^(5]ww + AL')-iL»Diag(S„,)whw 

= ^U^C^w Diag(Si/2)^(s^^ + XD)-^^ww7 

= I]i/JC^wDiag(Si/2)w7 

-Si/JC„,w Diag(5]i/2)w(Sww + XD)-^XDj. 

We have moreover 

Si/JC^wC'^\y-DDiag(Si/^)whw = ^i/J^'iuwf^ww^ww Diag(Sj/^)7, 
which leads to an upper bound for A: 

A ^ Ai"^/=^Al/2||I)||l/2||^|| ^ k-3/2aV2||^||3/2|^|1/2 ^ 4^-3/2;,l/2^(f )3^-3 1 w| 1/2 _ 

The constraint imposed on Eq. (l40l ) implies that A ^ Q{{)r]/5. 

Bounding C. We consider, for w G W^: 

T = S^w(5^ww + AZ))~"'^gw — S^w(Sww + AD)~^gw 
= AS,u,w(Sww + XD)-^{D - £')(Sww + AD)-i(?w 
A-i||r|| 5^ A-i||L'-i||op||Z)-^||op||(Sww + AD)-i/2g^|| 

5^ 4A^id72j7(f)2z.-3||/w - fwlllKSww + XDr^/\w\\, 

leading to the bound C ^ (i~^A~^ ||T||. The constraint imposed on Eq. (|42l) implies that C ^ 
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B.8 Probability of Incorrect Hull Selection 

We now need to lower bound the probability of all events from Eq. (l36l) . Eq. (1371 ). Eq. (1381) . Eq. 
Eq. (|40l ). Eq. (|4TI ) and Eq. (I42l) . They can first be summed up as: 

||fw-/w|| 5^ O (//^w~^|W|-i/ 

^l s; 0(a;"3|W|-i) 
||Sww - Swwiltr O (^u^^/^lWl-i/^mii^/^ 
^w4~9w<: + Sw<=w(Sww + XDy^qw] Ar2(f)r//5 = 0{fid~^) 



From Proposition [19] in order to have ||fvv — /w|| ^ O (/i^/^w ^|W| ^/^), we need to have 
||fw - /wll ^ O {fi^/^Lo"'^\W\~^), i.e., ||(Sww + XDy^/^qwW ^ 0{h^/^uj~'^/^\W\~^), ^ = 
0(^V4^-4|W|-5/2) and ll^ww - Swwiltr = 0{fi^/^uj-^\W\-^/^). 

From Proposition [T5l in order to bound ||(Sww+AD)~^/^(?w||, we require ||I]ww— 5]ww||tr 
0(/ii/2^-V2|w|-3/2)_ We finally require the following bounds: 



II^WW - Swwiltr ^ O 

^W'^i^^W'^ + Sw^wC^^ww + AD) ^qw] ^ 0{fidj.^) 



We can now use Propositions [14] and [15] as well as Eq. ([3T1) to obtain the desired upper bounds on 
probabilities. 

B.9 Lower Bound on Minimal Eigenvalues 

We provide a lemma used earlier in Section [B31 

Lemma 20 Let Q be a symmetric matrix defined by blocks and (ui) a sequence of unit norm vectors 
adapted to the blocks defining Q. We have: 



Amin ( Q + Diag - UiuJ) 



Kdn[iujQijUj)ij] . J mirij^. 



iQ) 

Proof We consider the orthogonal complements Vi of Ui, we then have 

[ui, . . .,Up]^ {Q + Diag - UiuJ)]) [ui, ...,Up] = {ujQijUj)ij 
[Vi, VpV {Q + Diag [fXiil - UiuJ)] )[Vi,...,Vp] = {V^QijVj + 6i=j^iil\j 
[Vi,...,Vp]^ ((5 + Diag Uiu])]) [ui,...,Up] = {V^QijUj)ij. 

We can now consider Schur complements: the eigenvalue we want to lower-bound is greater than v 

if 1/ ^ \min[{u] QijUj)ij] and 

{V^QijVj + 6i=jfiiT)ij - {V^ QijUj)ij{{uJ QijUj)ij - viy^{uj QijVj)ij )>= ul 
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which is equivalent to 



{Vi QijVj)i,j + - {Vi QijUj)ij{ui QijUj)^ - QijVj)ij 

+ [Vi^ QijUj)i,j iujQijUj)~j - i{ujQijUj)ij - uT)~^ {ujQijVj)i,j > (43) 



If we assume that v ^ X^-^rXiul QijUj)i,j\l2, then the second term has spectral norm less than 



2i^Amax(Q) 



AminKWi QijUj)i,j) 



The result follows. 
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